% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/fast_cvpvi.R
\name{fast_cvpvi}
\alias{fast_cvpvi}
\title{Fast cross-validated permutation variable importance (ranger-based)}
\usage{
fast_cvpvi(
  X,
  Y,
  k = 5,
  ntree = 500,
  nbf = 0,
  nthreads = max(1L, parallel::detectCores() - 1L),
  folds_parallel = c("auto", "TRUE", "FALSE"),
  mtry = NULL,
  sample_fraction = 1,
  min_node_size = 1L,
  seed = 123
)
}
\arguments{
\item{X}{Numeric matrix (n x p); samples in rows, features in columns. Column
names should be feature IDs (e.g., m/z). Non-finite values are set to zero
internally for modeling.}

\item{Y}{Factor or numeric response of length n. A factor triggers
classification; numeric triggers regression.}

\item{k}{Integer; number of cross-validation folds. Default 5.}

\item{ntree}{Integer; number of trees per fold model. Default 500.}

\item{nbf}{Integer (>= 0); number of artificial “false” (noise) features to
append to X for estimating the null distribution of importances. Default 0
disables this (the null is then approximated using mirrored negative
importances).}

\item{nthreads}{Integer; total threads available. When parallelizing folds,
each fold worker gets one ranger thread to avoid oversubscription; when not
parallelizing folds, ranger uses up to \code{nthreads} threads. Default is
max(1, detectCores() - 1).}

\item{folds_parallel}{Character; "auto", "TRUE", or "FALSE".
\itemize{
\item "auto": parallelize across folds when k > 1 and nthreads >= 4 (default).
\item "TRUE": force fold-level parallelism (PSOCK cluster).
\item "FALSE": evaluate folds sequentially (ranger can then use multiple threads).
}}

\item{mtry}{Optional integer; variables tried at each split. If NULL, defaults
to floor(sqrt(p)) for classification or max(floor(p/3), 1) for regression.}

\item{sample_fraction}{Numeric in (0, 1]; subsampling fraction per tree (speed/
regularization knob). Default 1.}

\item{min_node_size}{Integer; ranger minimum node size. Larger values speed up
training and yield smaller trees. Default 1.}

\item{seed}{Integer; RNG seed. Default 123.}
}
\value{
A list with:
\itemize{
\item nb_to_sel: integer; number of selected features (floor(p * (1 - pi0))).
\item sel_moz: character vector of selected feature names (columns of X).
\item imp_sel: named numeric vector of CV importances for selected features.
\item fold_varim: matrix (features x folds) of per-fold permutation importances.
\item cv_varim: matrix (features x 1) of averaged importances across folds.
\item pi0: estimated proportion of null features.
}
}
\description{
Computes cross-validated permutation variable importance (PVI) using the
ranger random-forest algorithm. For each CV fold, a ranger model is trained on
the training split and permutation importance is computed (OOB) inside ranger
in C++. Importances are averaged across folds to obtain a stable CV importance
vector. Optionally appends artificial “false” features to estimate the null
distribution and pi0, then selects the top (1 - pi0) proportion of features.
The evaluation can parallelize across folds (Windows-safe PSOCK) while
avoiding CPU oversubscription.
}
\details{
\itemize{
\item One ranger model is trained per fold (training split). Permutation
importance (importance = "permutation") is computed in C++ using OOB. The
per-fold importances are averaged to obtain CV importances.
\item Null and pi0: if \code{nbf > 0}, false peaks are created to get negative importances. For this, \code{nbf} noise features (uniform between \code{min(X)} and \code{max(X)})
are appended and negative importances among them help shape the null. If
\code{nbf = 0}, the null is approximated by mirroring negative importances of
true features. An estimator of the proportion of useless features over high quantiles yields pi0.
If no negative importances occur, pi0 is set to 0 (conservative).
\item Parallelism: with \code{folds_parallel = "auto"/"TRUE"}, folds run in parallel
using a PSOCK cluster (Windows-safe). Each worker sets ranger num.threads = 1
to avoid oversubscription. With \code{"FALSE"}, folds are sequential and ranger
uses up to \code{nthreads} threads, which can be faster for small k or very large p.
}
}
\examples{
\dontrun{
set.seed(1)
n <- 120; p <- 200
X <- matrix(rnorm(n * p), n, p)
colnames(X) <- paste0("mz_", seq_len(p))
Y <- factor(sample(letters[1:3], n, replace = TRUE))

if (requireNamespace("ranger", quietly = TRUE)) {
  out <- fast_cvpvi(
    X, Y,
    k = 5,
    ntree = 300,
    nbf = 50,
    nthreads = max(1L, parallel::detectCores() - 1L),
    folds_parallel = "auto",
    seed = 42
  )
  head(out$sel_moz)
  # CV importances for top features
  head(sort(out$cv_varim[,1], decreasing = TRUE))
}
}

}
\references{
Alexandre Godmer, Yahia Benzerara, Emmanuelle Varon, Nicolas Veziris, Karen Druart, Renaud Mozet, Mariette Matondo, Alexandra Aubry, Quentin Giai Gianetto, MSclassifR: An R package for supervised classification of mass spectra with machine learning methods, Expert Systems with Applications, Volume 294, 2025, 128796, ISSN 0957-4174, \doi{10.1016/j.eswa.2025.128796}.
}
\seealso{
ranger::ranger; for a holdout-based (validation-fold) permutation
alternative, see a custom implementation using predict() on permuted
features. For a full feature-selection wrapper, see SelectionVar with
MethodSelection = "cvp".
}
