% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/fit.R
\name{fit-workflow}
\alias{fit-workflow}
\alias{fit.workflow}
\title{Fit a workflow object}
\usage{
\method{fit}{workflow}(object, data, ..., control = control_workflow())
}
\arguments{
\item{object}{A workflow}

\item{data}{A data frame of predictors and outcomes to use when fitting the
workflow}

\item{...}{Not used}

\item{control}{A \code{\link[=control_workflow]{control_workflow()}} object}
}
\value{
The workflow \code{object}, updated with a fit parsnip model in the
\code{object$fit$fit} slot.
}
\description{
Fitting a workflow currently involves two main steps:
\itemize{
\item Preprocessing the data using a formula preprocessor, or by calling
\code{\link[recipes:prep]{recipes::prep()}} on a recipe.
\item Fitting the underlying parsnip model using \code{\link[parsnip:fit]{parsnip::fit.model_spec()}}.
}
}
\details{
In the future, there will also be \emph{postprocessing} steps that can be added
after the model has been fit.
}
\section{Indicator Variable Details}{
Some modeling functions in R create indicator/dummy variables from
categorical data when you use a model formula, and some do not. When you
specify and fit a model with a \code{workflow()}, parsnip and workflows match
and reproduce the underlying behavior of the user-specified model’s
computational engine.
\subsection{Formula Preprocessor}{

In the \link[modeldata:Sacramento]{modeldata::Sacramento} data set of real
estate prices, the \code{type} variable has three levels: \code{"Residential"},
\code{"Condo"}, and \code{"Multi-Family"}. This base \code{workflow()} contains a
formula added via \code{\link[=add_formula]{add_formula()}} to predict property
price from property type, square footage, number of beds, and number of
baths:\if{html}{\out{<div class="sourceCode r">}}\preformatted{set.seed(123)

library(parsnip)
library(recipes)
library(workflows)
library(modeldata)

data("Sacramento")

base_wf <- workflow() \%>\%
  add_formula(price ~ type + sqft + beds + baths)
}\if{html}{\out{</div>}}

This first model does create dummy/indicator variables:\if{html}{\out{<div class="sourceCode r">}}\preformatted{lm_spec <- linear_reg() \%>\%
  set_engine("lm")

base_wf \%>\%
  add_model(lm_spec) \%>\%
  fit(Sacramento)
}\if{html}{\out{</div>}}\preformatted{## == Workflow [trained] ================================================
## Preprocessor: Formula
## Model: linear_reg()
## 
## -- Preprocessor ------------------------------------------------------
## price ~ type + sqft + beds + baths
## 
## -- Model -------------------------------------------------------------
## 
## Call:
## stats::lm(formula = ..y ~ ., data = data)
## 
## Coefficients:
##      (Intercept)  typeMulti_Family   typeResidential  
##          32919.4          -21995.8           33688.6  
##             sqft              beds             baths  
##            156.2          -29788.0            8730.0
}

There are \strong{five} independent variables in the fitted model for this
OLS linear regression. With this model type and engine, the factor
predictor \code{type} of the real estate properties was converted to two
binary predictors, \code{typeMulti_Family} and \code{typeResidential}. (The third
type, for condos, does not need its own column because it is the
baseline level).

This second model does not create dummy/indicator variables:\if{html}{\out{<div class="sourceCode r">}}\preformatted{rf_spec <- rand_forest() \%>\%
  set_mode("regression") \%>\%
  set_engine("ranger")

base_wf \%>\%
  add_model(rf_spec) \%>\%
  fit(Sacramento)
}\if{html}{\out{</div>}}\preformatted{## == Workflow [trained] ================================================
## Preprocessor: Formula
## Model: rand_forest()
## 
## -- Preprocessor ------------------------------------------------------
## price ~ type + sqft + beds + baths
## 
## -- Model -------------------------------------------------------------
## Ranger result
## 
## Call:
##  ranger::ranger(x = maybe_data_frame(x), y = y, num.threads = 1,      verbose = FALSE, seed = sample.int(10^5, 1)) 
## 
## Type:                             Regression 
## Number of trees:                  500 
## Sample size:                      932 
## Number of independent variables:  4 
## Mtry:                             2 
## Target node size:                 5 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       7058847504 
## R squared (OOB):                  0.5894647
}

Note that there are \strong{four} independent variables in the fitted model
for this ranger random forest. With this model type and engine,
indicator variables were not created for the \code{type} of real estate
property being sold. Tree-based models such as random forest models can
handle factor predictors directly, and don’t need any conversion to
numeric binary variables.
}

\subsection{Recipe Preprocessor}{

When you specify a model with a \code{workflow()} and a recipe preprocessor
via \code{\link[=add_recipe]{add_recipe()}}, the \emph{recipe} controls whether dummy
variables are created or not; the recipe overrides any underlying
behavior from the model’s computational engine.
}
}

\examples{
library(parsnip)
library(recipes)
library(magrittr)

model <- linear_reg() \%>\%
  set_engine("lm")

base_wf <- workflow() \%>\%
  add_model(model)

formula_wf <- base_wf \%>\%
  add_formula(mpg ~ cyl + log(disp))

fit(formula_wf, mtcars)

recipe <- recipe(mpg ~ cyl + disp, mtcars) \%>\%
  step_log(disp)

recipe_wf <- base_wf \%>\%
  add_recipe(recipe)

fit(recipe_wf, mtcars)
}
