\name{Cross-Validation for SES and MMPC}
\alias{cv.ses}
\alias{cv.mmpc}
\alias{cv.waldses}
\alias{cv.waldmmpc}
\alias{cv.permses}
\alias{cv.permmmpc}
\alias{auc.mxm}
\alias{acc.mxm}
\alias{ord_mae.mxm}
\alias{mae.mxm}
\alias{mse.mxm}
\alias{pve.mxm}
\alias{ci.mxm}
\alias{ciwr.mxm}
\alias{fscore.mxm}
\alias{euclid_sens.spec.mxm}
\alias{glm.mxm}
\alias{lm.mxm}
\alias{rq.mxm}
\alias{acc_multinom.mxm}
\alias{lmrob.mxm}
\alias{weibreg.mxm}
\alias{coxph.mxm}
\alias{poisdev.mxm}
\alias{nbdev.mxm}
\alias{pois.mxm}
\alias{nb.mxm}
\alias{multinom.mxm}
\alias{ordinal.mxm}
\alias{beta.mxm}

\title{
Cross-Validation for SES and MMPC
}

\description{
The function performs a k-fold cross-validation for identifying the best values for the SES and MMPC 'max_k' and 'threshold' hyper-parameters.
}

\usage{
cv.ses(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, ses_test = NULL, 
ncores = 1, B = 1)

cv.mmpc(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, mmpc_test = NULL, 
ncores = 1, B = 1)

cv.waldses(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, ses_test = NULL,
ncores = 1, B = 1)

cv.waldmmpc(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, mmpc_test = NULL, 
ncores = 1, B = 1)

cv.permses(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, ses_test = NULL, R = 999, 
ncores = 1, B = 1)

cv.permmmpc(target, dataset, wei = NULL, kfolds = 10, folds = NULL, 
alphas = c(0.1, 0.05, 0.01), max_ks = c(3, 2), task = NULL, 
metric = NULL, metricbbc = NULL, modeler = NULL, mmpc_test = NULL, R = 999, 
ncores = 1, B = 1)
}

\arguments{
\item{target}{
The target or class variable as in SES and MMPC. The difference is that it cannot accept a single numeric value, an integer indicating the column in the dataset. 
}
\item{dataset}{
The dataset object as in SES and MMPC.
}
\item{wei}{
A vector of weights to be used for weighted regression. The default value is NULL. 
}
\item{kfolds}{
The number of the folds in the k-fold Cross Validation (integer).
}
\item{folds}{
The folds of the data to use (a list generated by the function generateCVRuns {TunePareto}). If NULL the folds are created internally with the same function.
}
\item{alphas}{
A vector of SES or MMPC thresholds hyper parameters used in CV. 
}
\item{max_ks}{
A vector of SES or MMPC max_ks parameters used in CV. 
}
\item{task}{
A character ("C", "R" or "S"). It can be "C" for classification (logistic, multinomial or ordinal regression), "R" for regression (robust and non robust linear regression, median regression, 
(zero inflated) poisson and negative binomial regression, beta regression), "S" for survival regresion (Cox, Weibull or exponential regression).
}
\item{metric}{
A metric function provided by the user. If NULL the following functions will be used: auc.mxm, mse.mxm, ci.mxm for classification, regression and survival analysis tasks, respectively. See details for more.
If you know what you have put it here to avoid the function choosing somehting else. \bold{Note} that you put these words as they are, without "".  
}
\item{metricbbc}{
This is the same argument as "metric" with the difference that " " must be placed. If for example, metric = auc.mxm, here metricbbc = "auc.mxm". The same value must be given here. This argument is to be used with the function \code{\link{bbc}} which does bootstrap bias correction of the estimated performance (Tsamardinos, Greasidou and Borboudakis, 2018). This argument is valid if the last argument (B) is more than 1. 
}
\item{modeler}{
A modeling function provided by the user. If NULL the following functions will be used: glm.mxm, lm.mxm, coxph.mxm for classification, regression and survival analysis tasks, respectively. See details for more.
If you know what you have put it here to avoid the function choosing somehting else. \bold{Note} that you put these words as they are, without "". 
}
\item{ses_test}{
A function object that defines the conditional independence test used in the SES function (see also SES help page). If NULL, "testIndFisher", "testIndLogistic" and "censIndCR" 
are used for classification, regression and survival analysis tasks, respectively. If you know what you have put it here to avoid the function choosing somehting else. 
Not all tests can be included here. "testIndClogit", "testIndMVreg", "testIndIG", "testIndGamma", "testIndZIP" and "testIndTobit" are anot available at the moment.  
}
\item{mmpc_test}{
A function object that defines the conditional independence test used in the MMPC function (see also SES help page). If NULL, "testIndFisher", "testIndLogistic" and "censIndCR"  are used for classification, 
regression and survival analysis tasks, respectively.
}
\item{R}{
The number of permutations, set to 999 by default. There is a trick to avoind doing all permutations. As soon as the number of times the permuted test statistic is more than the observed test statistic is more 
than 50 (if threshold = 0.05 and R = 999), the p-value has exceeded the signifiance level (threshold value) and hence the predictor variable is not significant. There is no need to continue do the extra permutations, 
as a decision has already been made. 
}
\item{ncores}{
This argument is valid only if you have a multi-threaded machine. 
}
\item{B}{
How many bootstrap re-samples to draw. This argument is to be used with the function \code{\link{bbc}} which does bootstrap bias correction of the estimated performance (Tsamardinos, Greasidou and Borboudakis, 2018). If you have thousands of samples (observations) then this might not be necessary, as there is no optimistic bias to be corrected. What is the lower limit cannot be told beforehand however. SES and MMPC however were designed for the low sample cases, hence, bootstrap bias correction is perhaps a must thing to do.
}
}

\details{

Input for metric functions:
predictions: A vector of predictions to be tested.
test_target: target variable actual values to be compared with the predictions.

The output of a metric function is a single numeric value. \bold{Higher values indicate better performance}. Metric based on error measures should be modified accordingly (e.g., multiplying the error for -1)

The metric functions that are currently supported are:
\itemize{
\item auc.mxm: "area under the receiver operator characteristic curve" metric for binary logistic regression.
\item acc.mxm: accuracy for binary logistic regression.
\item fscore.mxm: F score for binary logistic regression.
\item euclid_sens.spec.mxm: Euclidean norm of 1 - sensititivy and 1 - specificity for binary logistic regression.
\item acc_multinom.mxm: accuracy or multinomial logistic regression.
\item mse.mxm: mean squared error, for robust and non robust linear regression and median (quantile) regression (multiplied by -1). 
\item pve.mxm:  1 - (mean squared error)/( (n - 1) * var(y_out) ), for non robust linear regression. It is basically the proportion of variance explained in the test set. 
\item ci.mxm: 1 - concordance index as provided in the rcorr.cens function from the suvriva package. This is to be used with the Cox proportional hazards model only.
\item ciwr.mxm concordance index as provided in the rcorr.cens function from the survival package. This is to be used with the Weibull regression model only.
\item poisdev.mxm: Poisson regression deviance (multiplied by -1).
\item nbdev.mxm: Negative binomial regression deviance (multiplied by -1). 
\item binomdev.mxm: Negative binomial regression deviance (multiplied by -1). 
\item ord_mae.mxm: Ordinal regression mean absolute error (multiplied by -1). 
\item mae.mxm: Mean absolute error (multiplied by -1). 
}
Usage: metric(predictions, test_target)


Input of modelling functions:
train_target: target variable used in the training procedure.
sign_data: training set.
sign_test: test set.

Modelling functions provide a single vector of predictions obtained by applying the model fit on sign_data and train_target on the sign_test

The modelling functions that are currently supported are:
\itemize{
  \item glm.mxm: fits a glm for a binomial family (classification task).
  \item multinom.mxm: fits a multinomial regression model (classification task).
  \item lm.mxm: fits a linear model (regression task).
  \item coxph.mxm: fits a cox proportional hazards regression model (survival task).
  \item weibreg.mxm: fits a Weibull regression model (survival task).
  \item rq.mxm: fits a quantile (median) regression model (regression task).
  \item lmrob.mxm: fits a robust linear model (regression task).
  \item pois.mxm: fits a poisson regression model (regression task).  
  \item nb.mxm: fits a negative binomial regression model (regression task).  
  \item ordinal.mxm: fits an ordinal regression model (regression task).
  \item beta.mxm: fits a beta regression model (regression task). The predicted values are transformed into \eqn{R} using
        the logit transformation. This is so that the "mse.mxm" metric function can be used. In addition, this way the performance can be  
        compared with the regression scenario, where the logit is applied and then a regression model is employed. 

}
Usage: modeler(train_target, sign_data, sign_test)

The procedure will be more automated in the future and more functions will be added. 
The multithreaded functions have been tested and no error has been detected. However, if you spot any suspicious results please let us know. 

}

\value{
A list including:
\item{cv_results_all}{
A list with predictions, performances and signatures for each fold and each SES or MMPC configuration (e.g cv_results_all[[3]]$performances[1] indicates the performance of the 1st fold with the 
3d configuration of SES or MMPC). In the case of the multi-threaded functions (cvses.par and cvmmpc.par) this is a list with a matrix. The rows correspond to the folds and the columns to the 
configurations (pairs of threshold and max_k).
}
\item{best_performance}{
A numeric value that represents the best average performance.
}
\item{best_configuration}{
A list that corresponds to the best configuration of SES or MMPC including id, threshold (named 'a') and max_k.
}
\item{bbc_best_performance}{
The bootstrap bias corrected best performance if B was more than 1, othwerwise this is NULL.
}
\item{runtime}{
The runtime of the cross-validation procedure.
}

Bear in mind that the values can be extracted with the $ symbol, i.e. this is an S3 class output. 
}

\references{
Ioannis Tsamardinos, Elissavet Greasidou and Giorgos Borboudakis (2018). Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Machine Learning (To appear).  

https://link.springer.com/article/10.1007/s10994-018-5714-4
}

\author{
R implementation and documentation: Giorgos Athineou <athineou@csd.uoc.gr> and Vincenzo Lagani <vlagani@csd.uoc.gr>
}

%\note{
%%  ~~further notes~~
%}

\seealso{
\code{\link{SES}, \link{CondIndTests}, \link{cv.gomp}, \link{bbc}, \link{testIndFisher}, \link{testIndLogistic}, \link{gSquare}, \link{censIndCR}}
}

\examples{
set.seed(1234)

# simulate a dataset with continuous data
dataset <- matrix( rnorm(200 * 50), ncol = 50 )
# the target feature is the last column of the dataset as a vector
target <- dataset[, 50]
dataset <- dataset[, -50]

# get 50 percent of the dataset as a train set
train_set <- dataset[1:100, ]
train_target <- target[1:100]

require(hash)
# run a 10 fold CV for the regression task
best_model = cv.ses(target = train_target, dataset = train_set, kfolds = 5, task = "R")

# get the results
best_model$best_configuration
best_model$best_performance

# summary elements of the process. Press tab after each $ to view all the elements and
# choose the one you are intresting in.
# best_model$cv_results_all[[...]]$...
#i.e.
# mse value for the 1st configuration of SES of the 5 fold
abs(best_model$cv_results_all[[1]]$performances[5])

best_a <- best_model$best_configuration$a
best_max_k <- best_model$best_configuration$max_k
}

\keyword{ Cross validation}
\keyword{ SES }
\keyword{ MMPC }
\keyword{ parallel }

