When you run make()
, drake
stores your imports and output targets in a hidden cache.
library(drake)
load_basic_example(verbose = FALSE) # Get the code with drake_example("basic").
config <- make(my_plan, verbose = FALSE)
You can explore your cached data using functions like loadd()
, readd()
, and cached()
.
head(cached())
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## [1] "'report.Rmd'" "'report.md'"
## [3] "coef_regression1_large" "coef_regression1_small"
## [5] "coef_regression2_large" "coef_regression2_small"
readd(small)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## x y
## 1 2.620 21.0
## 2 1.835 33.9
## 3 2.770 19.7
## 4 1.835 33.9
## 5 1.615 30.4
## 6 2.320 22.8
## 7 2.620 21.0
## 8 3.215 21.4
## 9 3.520 15.5
## 10 4.070 16.4
## 11 5.250 10.4
## 12 3.460 18.1
## 13 3.190 24.4
## 14 3.150 22.8
## 15 5.345 14.7
## 16 3.460 18.1
## 17 1.513 30.4
## 18 1.835 33.9
## 19 3.840 13.3
## 20 3.440 17.8
## 21 2.875 21.0
## 22 2.465 21.5
## 23 4.070 16.4
## 24 3.840 13.3
## 25 5.345 14.7
## 26 5.345 14.7
## 27 5.250 10.4
## 28 2.770 19.7
## 29 3.435 15.2
## 30 2.875 21.0
## 31 2.620 21.0
## 32 3.170 15.8
## 33 2.780 21.4
## 34 2.780 21.4
## 35 2.320 22.8
## 36 3.845 19.2
## 37 2.620 21.0
## 38 3.780 15.2
## 39 3.570 15.0
## 40 3.190 24.4
## 41 3.845 19.2
## 42 2.770 19.7
## 43 3.170 15.8
## 44 4.070 16.4
## 45 3.435 15.2
## 46 3.440 19.2
## 47 3.520 15.5
## 48 4.070 16.4
loadd(large)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
head(large)
## x y
## 1 3.520 15.5
## 2 3.845 19.2
## 3 5.424 10.4
## 4 5.424 10.4
## 5 5.345 14.7
## 6 3.570 14.3
rm(large) # Does not remove `large` from the cache.
The storr package does the heavy lifting. A storr
is an object in R that serves as an abstraction for a storage backend, usually a file system. See the main storr vignette for a thorough walkthrough.
class(config$cache) # from `config <- make(...)`
## [1] "storr" "R6"
cache <- get_cache() # Get the default cache from the last build.
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
class(cache)
## [1] "storr" "R6"
cache$list() # Functionality from storr
## [1] "'report.Rmd'" "'report.md'"
## [3] "coef_regression1_large" "coef_regression1_small"
## [5] "coef_regression2_large" "coef_regression2_small"
## [7] "data.frame" "knit"
## [9] "large" "lm"
## [11] "mtcars" "nrow"
## [13] "reg1" "reg2"
## [15] "regression1_large" "regression1_small"
## [17] "regression2_large" "regression2_small"
## [19] "sample.int" "simulate"
## [21] "small" "summ_regression1_large"
## [23] "summ_regression1_small" "summ_regression2_large"
## [25] "summ_regression2_small" "summary"
## [27] "suppressWarnings"
cache$get("small") # Functionality from storr
## x y
## 1 2.620 21.0
## 2 1.835 33.9
## 3 2.770 19.7
## 4 1.835 33.9
## 5 1.615 30.4
## 6 2.320 22.8
## 7 2.620 21.0
## 8 3.215 21.4
## 9 3.520 15.5
## 10 4.070 16.4
## 11 5.250 10.4
## 12 3.460 18.1
## 13 3.190 24.4
## 14 3.150 22.8
## 15 5.345 14.7
## 16 3.460 18.1
## 17 1.513 30.4
## 18 1.835 33.9
## 19 3.840 13.3
## 20 3.440 17.8
## 21 2.875 21.0
## 22 2.465 21.5
## 23 4.070 16.4
## 24 3.840 13.3
## 25 5.345 14.7
## 26 5.345 14.7
## 27 5.250 10.4
## 28 2.770 19.7
## 29 3.435 15.2
## 30 2.875 21.0
## 31 2.620 21.0
## 32 3.170 15.8
## 33 2.780 21.4
## 34 2.780 21.4
## 35 2.320 22.8
## 36 3.845 19.2
## 37 2.620 21.0
## 38 3.780 15.2
## 39 3.570 15.0
## 40 3.190 24.4
## 41 3.845 19.2
## 42 2.770 19.7
## 43 3.170 15.8
## 44 4.070 16.4
## 45 3.435 15.2
## 46 3.440 19.2
## 47 3.520 15.5
## 48 4.070 16.4
The concept of hashing is central to storr's internals. Storr uses hashes to label stored objects, and drake
leverages these hashes to figure out which targets are up to date and which ones are outdated. A hash is like a target's fingerprint, so the hash changes when the target changes. Regardless of the target's size, the hash is always the same number of characters.
library(digest) # package for hashing objects and files
smaller_data <- 12
larger_data <- rnorm(1000)
digest(smaller_data) # compute the hash
## [1] "23c80a31c0713176016e6e18d76a5f31"
digest(larger_data)
## [1] "67e461da531a9ef268989f2551934379"
However, different hash algorithms vary in length.
digest(larger_data, algo = "sha512")
## [1] "4ed2c12fee63fddb4a669960ed885ab899356e27a5e574b7062ddf29a6d9fe9f8f83e52d4b6e37271c6ee3ee4397e619a90a4b350211f00c03b2eebb31d1424b"
digest(larger_data, algo = "md5")
## [1] "67e461da531a9ef268989f2551934379"
digest(larger_data, algo = "xxhash64")
## [1] "34cd803ea2d90bf6"
digest(larger_data, algo = "murmur32")
## [1] "b4c76dc0"
Hashing is expensive, and unsurprisingly, shorter hashes are usually faster to compute. So why not always use murmur32
? One reason is the risk of collisions: that is, when two different objects have the same hash. In general, shorter hashes have more frequent collisions. On the other hand, a longer hash is not always the answer. Besides the loss of speed, drake
and storr sometimes use hash keys as file names, and long hashes could violate the 260-character cap on Windows file paths. That is why drake
uses a shorter hash algorithm for internal cache-related file names and a longer hash algorithm for everything else.
default_short_hash_algo()
## [1] "xxhash64"
default_long_hash_algo()
## [1] "sha256"
short_hash(cache)
## [1] "xxhash64"
long_hash(cache)
## [1] "sha256"
For new projects, use new_cache()
to set the hash algorithms of the default cache.
cache_path(cache) # Default cache from before.
## [1] "/tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake"
# Start from scratch to reset both hash algorithms.
clean(destroy = TRUE)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
tmp <- new_cache(
path = default_cache_path(), # The `.drake/` folder.
short_hash_algo = "crc32",
long_hash_algo = "sha1"
)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
The cache at default_cache_path()
(equivalently, the .drake/
folder) is the default cache used for make()
.
config <- make(my_plan, verbose = FALSE)
short_hash(config$cache) # xxhash64 is the default_short_hash_algo()
## [1] "crc32"
long_hash(config$cache) # sha256 is the default_long_hash_algo()
## [1] "sha1"
You can change the long hash algorithm without throwing away the cache, but your project will rebuild from scratch. As for the short hash, you are committed until you delete the cache and all its supporting files.
outdated(config) # empty
## character(0)
config$cache <- configure_cache(
config$cache,
long_hash_algo = "murmur32",
overwrite_hash_algos = TRUE
)
Below, the targets become outdated because the existing hash keys do not match the new hash algorithm.
config <- drake_config(my_plan, verbose = FALSE, cache = config$cache)
outdated(config)
## [1] "'report.md'" "coef_regression1_large"
## [3] "coef_regression1_small" "coef_regression2_large"
## [5] "coef_regression2_small" "large"
## [7] "regression1_large" "regression1_small"
## [9] "regression2_large" "regression2_small"
## [11] "small" "summ_regression1_large"
## [13] "summ_regression1_small" "summ_regression2_large"
## [15] "summ_regression2_small"
config <- make(my_plan, verbose = FALSE)
short_hash(config$cache) # same as before
## [1] "crc32"
long_hash(config$cache) # different from before
## [1] "murmur32"
You do not need to use the default cache at the default_cache_path()
(.drake/
). However, if you use a different file system, such as the custom faster_cache/
folder below, you will need to manually supply the cache to all functions that require one.
faster_cache <- new_cache(
path = "faster_cache",
short_hash_algo = "murmur32",
long_hash_algo = "murmur32"
)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/faster_cache
cache_path(faster_cache)
## [1] "/tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/faster_cache"
cache_path(cache) # location of the previous cache
## [1] "/tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake"
short_hash(faster_cache)
## [1] "murmur32"
long_hash(faster_cache)
## [1] "murmur32"
new_plan <- drake_plan(
simple = 1 + 1
)
make(new_plan, cache = faster_cache)
## connect 64 imports: additions, output_plan, data_plan, faster_cache, results,...
## connect 1 target: simple
## check 1 item: simple
## target simple
cached(cache = faster_cache)
## [1] "simple"
readd(simple, cache = faster_cache)
## [1] 2
You can recover an old cache from the file system. You could use storr::storr_rds()
directly if you know the short hash algorithm, but this_cache()
and recover_cache()
are safer for drake
. get_cache()
is similar, but it has a slightly different interface.
old_cache <- this_cache("faste_cache") # Get a cache you know exists...
recovered <- recover_cache("faster_cache") # or create a new one if missing.
If you want bypass drake
and generate a cache directly from storr, it is best to do so right from the beginning.
library(storr)
my_storr <- storr_rds("my_storr", mangle_key = TRUE)
make(new_plan, cache = my_storr)
## Unloading targets from environment:
## simple
## connect 65 imports: my_storr, additions, output_plan, data_plan, faster_cache...
## connect 1 target: simple
## check 1 item: simple
## target simple
cached(cache = my_storr)
## [1] "simple"
readd(simple, cache = my_storr)
## [1] 2
In addition to storr_rds()
, drake
supports in-memory caches created from storr_environment()
. However, parallel computing is not supported these caches. The jobs
argument must be 1, and the parallelism
argument must be either "mclapply"
or "parLapply"
. (It is sufficient to leave the default values alone.)
memory_cache <- storr_environment()
other_plan <- drake_plan(
some_data = rnorm(50),
more_data = rpois(75, lambda = 10),
result = mean(c(some_data, more_data))
)
make(other_plan, cache = memory_cache)
## connect 68 imports: my_storr, additions, output_plan, data_plan, faster_cache...
## connect 3 targets: some_data, more_data, result
## check 4 items: c, mean, rnorm, rpois
## check 2 items: more_data, some_data
## target more_data
## target some_data
## check 1 item: result
## target result
cached(cache = memory_cache)
## [1] "c" "mean" "more_data" "result" "rnorm" "rpois"
## [7] "some_data"
readd(result, cache = memory_cache)
## [1] 5.975047
In theory, it should be possible to leverage serious databases using storr_dbi()
. However, if you use such caches, please heed the following.
parallelism
and jobs
arguments to make()
as the defaults. This is because storr_dbi()
caches have a small number of files internally, so several parallel processes could try to write to the same file at the same time. So far, only storr_rds()
caches (default) were designed for use with parallel computing.The following example requires the DBI
and RSQLite
packages.
mydb <- DBI::dbConnect(RSQLite::SQLite(), "my-db.sqlite")
cache <- storr::storr_dbi(
tbl_data = "data",
tbl_keys = "keys",
con = mydb
)
load_basic_example() # Get the code with drake_example("basic").
unlink(".drake", recursive = TRUE)
make(my_plan, cache = cache)
If you want to start from scratch, you can clean()
the cache. Use the destroy
argument to remove it completely. cache$del()
and cache$destroy()
are also options, but they leave output file targets dangling. By contrast, clean(destroy = TRUE)
removes file targets generated by drake::make()
. drake_gc()
and clean(..., garbage_collection = TRUE)
do garbage collection, and clean(purge = TRUE)
removes all target-level data, not just the final output values.
clean(small, large)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
cached() # 'small' and 'large' are gone
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## [1] "'report.Rmd'" "'report.md'"
## [3] "coef_regression1_large" "coef_regression1_small"
## [5] "coef_regression2_large" "coef_regression2_small"
## [7] "data.frame" "knit"
## [9] "lm" "mtcars"
## [11] "nrow" "reg1"
## [13] "reg2" "regression1_large"
## [15] "regression1_small" "regression2_large"
## [17] "regression2_small" "sample.int"
## [19] "simulate" "summ_regression1_large"
## [21] "summ_regression1_small" "summ_regression2_large"
## [23] "summ_regression2_small" "summary"
## [25] "suppressWarnings"
clean(destroy = TRUE)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
clean(destroy = TRUE, cache = faster_cache)
clean(destroy = TRUE, cache = my_storr)