This vignette describes general best practices for creating, configuring, and running drake
projects.
It is best to write your code as a bunch of functions. You can save those functions in R scripts and then source()
them before doing anything else.
# Load functions get_data(), analyze_data, and summarize_results()
source("my_functions.R")
Then, set up your workflow plan data frame.
good_plan <- drake_plan(
my_data = get_data('data.csv'), # External files need to be in commands explicitly. # nolint
my_analysis = analyze_data(my_data),
my_summaries = summarize_results(my_data, my_analysis)
)
good_plan
## target command
## 1 my_data get_data('data.csv')
## 2 my_analysis analyze_data(my_data)
## 3 my_summaries summarize_results(my_data, my_analysis)
Drake
knows that my_analysis
depends on my_data
because my_data
is an argument to analyze_data()
, which is part of the command for my_analysis
.
config <- drake_config(good_plan)
vis_drake_graph(config)
Now, you can call make()
to build the targets.
make(good_plan)
If your commands are really long, just put them in larger functions. Drake
analyzes imported functions for non-file dependencies.
Some people are accustomed to dividing their work into R scripts and then calling source()
to run each step of the analysis. For example you might have the following files.
get_data.R
analyze_data.R
summarize_results.R
If you migrate to drake
, you may be tempted to set up a workflow plan like this.
bad_plan <- drake_plan(
my_data = source('get_data.R'), # nolint
my_analysis = source('analyze_data.R'), # nolint
my_summaries = source('summarize_data.R') # nolint
)
bad_plan
## target command
## 1 my_data source('get_data.R')
## 2 my_analysis source('analyze_data.R')
## 3 my_summaries source('summarize_data.R')
But now, the dependency structure of your work is broken. Your R script files are dependencies, but since my_data
is not mentioned in a function or command, drake
does not know that my_analysis
depends on it.
config <- drake_config(bad_plan)
vis_drake_graph(config)
Dangers:
make(bad_plan, jobs = 2)
, drake
will try to build my_data
and my_analysis
at the same time even though my_data
must finish before my_analysis
begins.Drake
is oblivious to data.csv
since it is not explicitly mentioned in a workflow plan command. So when data.csv
changes, make(bad_plan)
will not rebuild my_data
.my_analysis
will not update when my_data
changes.source()
is formatted counter-intuitively. If source('get_data.R')
is the command for my_data
, then my_data
will always be a list with elements "value"
and "visible"
. In other words, source('get_data.R')$value
is really what you would want.In addition, this source()
-based approach is simply inconvenient. Drake
rebuilds my_data
every time get_data.R
changes, even when those changes are just extra comments or blank lines. On the other hand, in the previous plan that uses my_data = get_data()
, drake
does not trigger rebuilds when comments or whitespace in get_data()
are modified. Drake
is R-focused, not file-focused. If you embrace this viewpoint, your work will be easier.
Drake
makes special exceptions for R Markdown reports and other knitr reports such as *.Rmd
and *.Rnw
files. Not every drake
project needs them, but it is good practice to use them to summarize the final results of a project once all the other targets have already been built. The basic example, for instance, has an R Markdown report. report.Rmd
is knitted to build report.md
, which summarizes the final results.
# Load all the functions and the workflow plan data frame, my_plan.
load_basic_example() # Get the code with drake_example("basic").
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## connect 7 imports: tmp, simulate, reg1, my_plan, reg2, bad_plan, good_plan
## connect 15 targets: 'report.md', small, large, regression1_small, regression1...
To see where report.md
will be built, look to the right of the workflow graph.
config <- drake_config(my_plan)
vis_drake_graph(config)
Drake
treats knitr report as a special cases. Whenever drake
sees knit()
or render()
(rmarkdown) mentioned in a command, it dives into the source file to look for dependencies. Consider report.Rmd
, which you can view here. When drake
sees readd(small)
in an active code chunk, it knows report.Rmd depends on the target called small
, and it draws the appropriate arrow in the workflow graph above. And if small
ever changes, make(my_plan)
will re-process report.Rmd to produce the target file report.md
.
knitr reports are the only kind of file that drake
analyzes for dependencies. It does not give R scripts the same special treatment.
## Error in file.remove(): invalid first filename