% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/step_rose.R
\name{step_rose}
\alias{step_rose}
\title{Apply ROSE Algorithm}
\usage{
step_rose(
  recipe,
  ...,
  role = NA,
  trained = FALSE,
  column = NULL,
  over_ratio = 1,
  minority_prop = 0.5,
  minority_smoothness = 1,
  majority_smoothness = 1,
  skip = TRUE,
  seed = sample.int(10^5, 1),
  id = rand_id("rose")
)
}
\arguments{
\item{recipe}{A recipe object. The step will be added to the
sequence of operations for this recipe.}

\item{...}{One or more selector functions to choose which
variable is used to sample the data. See \code{\link[=selections]{selections()}}
for more details. The selection should result in \emph{single
factor variable}. For the \code{tidy} method, these are not
currently used.}

\item{role}{Not used by this step since no new variables are
created.}

\item{trained}{A logical to indicate if the quantities for
preprocessing have been estimated.}

\item{column}{A character string of the variable name that will
be populated (eventually) by the \code{...} selectors.}

\item{over_ratio}{A numeric value for the ratio of the
majority-to-minority frequencies. The default value (1) means
that all other levels are sampled up to have the same
frequency as the most occurring level. A value of 0.5 would mean
that the minority levels will have (at most) (approximately)
half as many rows than the majority level.}

\item{minority_prop}{A numeric. Determines the of over-sampling of the
minority class. Defaults to 0.5.}

\item{minority_smoothness}{A numeric. Shrink factor to be multiplied by the
smoothing parameters to estimate the conditional kernel density of the
minority class. Defaults to 1.}

\item{majority_smoothness}{A numeric. Shrink factor to be multiplied by the
smoothing parameters to estimate the conditional kernel density of the
majority class. Defaults to 1.}

\item{skip}{A logical. Should the step be skipped when the
recipe is baked by \code{\link[recipes:bake]{bake()}}? While all operations are baked
when \code{\link[recipes:prep]{prep()}} is run, some operations may not be able to be
conducted on new data (e.g. processing the outcome variable(s)).
Care should be taken when using \code{skip = TRUE} as it may affect
the computations for subsequent operations.}

\item{seed}{An integer that will be used as the seed when
rose-ing.}

\item{id}{A character string that is unique to this step to identify it.}
}
\value{
An updated version of \code{recipe} with the new step
added to the sequence of existing steps (if any). For the
\code{tidy} method, a tibble with columns \code{terms} which is
the variable used to sample.
}
\description{
\code{step_rose} creates a \emph{specification} of a recipe
step that generates sample of synthetic data by enlarging the features
space of minority and majority class example. Using \code{\link[ROSE:ROSE]{ROSE::ROSE()}}.
}
\details{
The factor variable used to balance around must only have 2 levels.

The ROSE algorithm works by selecting an observation belonging to class k
and generates new examples  in its neighborhood is determined by some matrix
H_k. Smaller values of these arguments have the effect of shrinking the
entries of the corresponding smoothing matrix H_k, Shrinking would be a
cautious choice if there is a concern that excessively large neighborhoods
could lead to blur the boundaries between the regions of the feature space
associated with each class.

All columns in the data are sampled and returned by \code{\link[=juice]{juice()}}
and \code{\link[=bake]{bake()}}.

When used in modeling, users should strongly consider using the
option \code{skip = TRUE} so that the extra sampling is \emph{not}
conducted outside of the training set.
}
\section{Tidying}{
When you \code{\link[=tidy.recipe]{tidy()}} this step, a tibble with columns \code{terms}
(the selectors or variables selected) will be returned.
}

\section{Tuning Parameters}{
This step has 1 tuning parameters:
\itemize{
\item \code{over_ratio}: Over-Sampling Ratio (type: double, default: 1)
}
}

\section{Case weights}{


The underlying operation does not allow for case weights.
}

\examples{
library(recipes)
library(modeldata)
data(hpc_data)

hpc_data0 <- hpc_data \%>\%
  mutate(class = factor(class == "VF", labels = c("not VF", "VF"))) \%>\%
  select(-protocol, -day)

orig <- count(hpc_data0, class, name = "orig")
orig

up_rec <- recipe(class ~ ., data = hpc_data0) \%>\%
  step_rose(class) \%>\%
  prep()

training <- up_rec \%>\%
  bake(new_data = NULL) \%>\%
  count(class, name = "training")
training

# Since `skip` defaults to TRUE, baking the step has no effect
baked <- up_rec \%>\%
  bake(new_data = hpc_data0) \%>\%
  count(class, name = "baked")
baked

orig \%>\%
  left_join(training, by = "class") \%>\%
  left_join(baked, by = "class")

library(ggplot2)

ggplot(circle_example, aes(x, y, color = class)) +
  geom_point() +
  labs(title = "Without ROSE")

recipe(class ~ x + y, data = circle_example) \%>\%
  step_rose(class) \%>\%
  prep() \%>\%
  bake(new_data = NULL) \%>\%
  ggplot(aes(x, y, color = class)) +
  geom_point() +
  labs(title = "With ROSE")
}
\references{
Lunardon, N., Menardi, G., and Torelli, N. (2014). ROSE: a
Package for Binary Imbalanced Learning. R Jorunal, 6:82–92.

Menardi, G. and Torelli, N. (2014). Training and assessing
classification rules with imbalanced data. Data Mining and Knowledge
Discovery, 28:92–122.
}
\seealso{
Other Steps for over-sampling: 
\code{\link{step_adasyn}()},
\code{\link{step_bsmote}()},
\code{\link{step_smotenc}()},
\code{\link{step_smote}()},
\code{\link{step_upsample}()}
}
\concept{Steps for over-sampling}
