Storage basics

When you run make(), drake stores your imports and output targets in a hidden cache.

library(drake)
load_basic_example(verbose = FALSE) # Get the code with drake_example("basic").
config <- make(my_plan, verbose = FALSE)

You can explore your cached data using functions like loadd(), readd(), and cached().

head(cached())
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
## [1] "'report.Rmd'"           "'report.md'"           
## [3] "coef_regression1_large" "coef_regression1_small"
## [5] "coef_regression2_large" "coef_regression2_small"

readd(small)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
##        x    y
## 1  2.620 21.0
## 2  1.835 33.9
## 3  2.770 19.7
## 4  1.835 33.9
## 5  1.615 30.4
## 6  2.320 22.8
## 7  2.620 21.0
## 8  3.215 21.4
## 9  3.520 15.5
## 10 4.070 16.4
## 11 5.250 10.4
## 12 3.460 18.1
## 13 3.190 24.4
## 14 3.150 22.8
## 15 5.345 14.7
## 16 3.460 18.1
## 17 1.513 30.4
## 18 1.835 33.9
## 19 3.840 13.3
## 20 3.440 17.8
## 21 2.875 21.0
## 22 2.465 21.5
## 23 4.070 16.4
## 24 3.840 13.3
## 25 5.345 14.7
## 26 5.345 14.7
## 27 5.250 10.4
## 28 2.770 19.7
## 29 3.435 15.2
## 30 2.875 21.0
## 31 2.620 21.0
## 32 3.170 15.8
## 33 2.780 21.4
## 34 2.780 21.4
## 35 2.320 22.8
## 36 3.845 19.2
## 37 2.620 21.0
## 38 3.780 15.2
## 39 3.570 15.0
## 40 3.190 24.4
## 41 3.845 19.2
## 42 2.770 19.7
## 43 3.170 15.8
## 44 4.070 16.4
## 45 3.435 15.2
## 46 3.440 19.2
## 47 3.520 15.5
## 48 4.070 16.4

loadd(large)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake

head(large)
##       x    y
## 1 3.520 15.5
## 2 3.845 19.2
## 3 5.424 10.4
## 4 5.424 10.4
## 5 5.345 14.7
## 6 3.570 14.3

rm(large) # Does not remove `large` from the cache.

Caches as R objects

The storr package does the heavy lifting. A storr is an object in R that serves as an abstraction for a storage backend, usually a file system. See the main storr vignette for a thorough walkthrough.

class(config$cache) # from `config <- make(...)`
## [1] "storr" "R6"

cache <- get_cache() # Get the default cache from the last build.
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake

class(cache)
## [1] "storr" "R6"

cache$list() # Functionality from storr
##  [1] "'report.Rmd'"           "'report.md'"           
##  [3] "coef_regression1_large" "coef_regression1_small"
##  [5] "coef_regression2_large" "coef_regression2_small"
##  [7] "data.frame"             "knit"                  
##  [9] "large"                  "lm"                    
## [11] "mtcars"                 "nrow"                  
## [13] "reg1"                   "reg2"                  
## [15] "regression1_large"      "regression1_small"     
## [17] "regression2_large"      "regression2_small"     
## [19] "sample.int"             "simulate"              
## [21] "small"                  "summ_regression1_large"
## [23] "summ_regression1_small" "summ_regression2_large"
## [25] "summ_regression2_small" "summary"               
## [27] "suppressWarnings"

cache$get("small") # Functionality from storr
##        x    y
## 1  2.620 21.0
## 2  1.835 33.9
## 3  2.770 19.7
## 4  1.835 33.9
## 5  1.615 30.4
## 6  2.320 22.8
## 7  2.620 21.0
## 8  3.215 21.4
## 9  3.520 15.5
## 10 4.070 16.4
## 11 5.250 10.4
## 12 3.460 18.1
## 13 3.190 24.4
## 14 3.150 22.8
## 15 5.345 14.7
## 16 3.460 18.1
## 17 1.513 30.4
## 18 1.835 33.9
## 19 3.840 13.3
## 20 3.440 17.8
## 21 2.875 21.0
## 22 2.465 21.5
## 23 4.070 16.4
## 24 3.840 13.3
## 25 5.345 14.7
## 26 5.345 14.7
## 27 5.250 10.4
## 28 2.770 19.7
## 29 3.435 15.2
## 30 2.875 21.0
## 31 2.620 21.0
## 32 3.170 15.8
## 33 2.780 21.4
## 34 2.780 21.4
## 35 2.320 22.8
## 36 3.845 19.2
## 37 2.620 21.0
## 38 3.780 15.2
## 39 3.570 15.0
## 40 3.190 24.4
## 41 3.845 19.2
## 42 2.770 19.7
## 43 3.170 15.8
## 44 4.070 16.4
## 45 3.435 15.2
## 46 3.440 19.2
## 47 3.520 15.5
## 48 4.070 16.4

Hash algorithms

The concept of hashing is central to storr's internals. Storr uses hashes to label stored objects, and drake leverages these hashes to figure out which targets are up to date and which ones are outdated. A hash is like a target's fingerprint, so the hash changes when the target changes. Regardless of the target's size, the hash is always the same number of characters.

library(digest) # package for hashing objects and files
smaller_data <- 12
larger_data <- rnorm(1000)

digest(smaller_data) # compute the hash
## [1] "23c80a31c0713176016e6e18d76a5f31"

digest(larger_data)
## [1] "67e461da531a9ef268989f2551934379"

However, different hash algorithms vary in length.

digest(larger_data, algo = "sha512")
## [1] "4ed2c12fee63fddb4a669960ed885ab899356e27a5e574b7062ddf29a6d9fe9f8f83e52d4b6e37271c6ee3ee4397e619a90a4b350211f00c03b2eebb31d1424b"

digest(larger_data, algo = "md5")
## [1] "67e461da531a9ef268989f2551934379"

digest(larger_data, algo = "xxhash64")
## [1] "34cd803ea2d90bf6"

digest(larger_data, algo = "murmur32")
## [1] "b4c76dc0"

Which hash algorithm should you choose?

Hashing is expensive, and unsurprisingly, shorter hashes are usually faster to compute. So why not always use murmur32? One reason is the risk of collisions: that is, when two different objects have the same hash. In general, shorter hashes have more frequent collisions. On the other hand, a longer hash is not always the answer. Besides the loss of speed, drake and storr sometimes use hash keys as file names, and long hashes could violate the 260-character cap on Windows file paths. That is why drake uses a shorter hash algorithm for internal cache-related file names and a longer hash algorithm for everything else.

default_short_hash_algo()
## [1] "xxhash64"

default_long_hash_algo()
## [1] "sha256"

short_hash(cache)
## [1] "xxhash64"

long_hash(cache)
## [1] "sha256"

Select the hash algorithms of the default cache

For new projects, use new_cache() to set the hash algorithms of the default cache.

cache_path(cache) # Default cache from before.
## [1] "/tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake"

# Start from scratch to reset both hash algorithms.
clean(destroy = TRUE)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake

tmp <- new_cache(
  path = default_cache_path(), # The `.drake/` folder.
  short_hash_algo = "crc32",
  long_hash_algo = "sha1"
)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake

The cache at default_cache_path() (equivalently, the .drake/ folder) is the default cache used for make().

config <- make(my_plan, verbose = FALSE)

short_hash(config$cache) # xxhash64 is the default_short_hash_algo()
## [1] "crc32"

long_hash(config$cache) # sha256 is the default_long_hash_algo()
## [1] "sha1"

You can change the long hash algorithm without throwing away the cache, but your project will rebuild from scratch. As for the short hash, you are committed until you delete the cache and all its supporting files.

outdated(config) # empty
## character(0)

config$cache <- configure_cache(
  config$cache,
  long_hash_algo = "murmur32",
  overwrite_hash_algos = TRUE
)

Below, the targets become outdated because the existing hash keys do not match the new hash algorithm.

config <- drake_config(my_plan, verbose = FALSE, cache = config$cache)
outdated(config)
##  [1] "'report.md'"            "coef_regression1_large"
##  [3] "coef_regression1_small" "coef_regression2_large"
##  [5] "coef_regression2_small" "large"                 
##  [7] "regression1_large"      "regression1_small"     
##  [9] "regression2_large"      "regression2_small"     
## [11] "small"                  "summ_regression1_large"
## [13] "summ_regression1_small" "summ_regression2_large"
## [15] "summ_regression2_small"

config <- make(my_plan, verbose = FALSE)

short_hash(config$cache) # same as before
## [1] "crc32"

long_hash(config$cache) # different from before
## [1] "murmur32"

More on custom caches

You do not need to use the default cache at the default_cache_path() (.drake/). However, if you use a different file system, such as the custom faster_cache/ folder below, you will need to manually supply the cache to all functions that require one.

faster_cache <- new_cache(
  path = "faster_cache",
  short_hash_algo = "murmur32",
  long_hash_algo = "murmur32"
)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/faster_cache

cache_path(faster_cache)
## [1] "/tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/faster_cache"

cache_path(cache) # location of the previous cache
## [1] "/tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake"

short_hash(faster_cache)
## [1] "murmur32"

long_hash(faster_cache)
## [1] "murmur32"

new_plan <- drake_plan(
  simple = 1 + 1
)

make(new_plan, cache = faster_cache)
## connect 64 imports: additions, output_plan, data_plan, faster_cache, results,...
## connect 1 target: simple
## check 1 item: simple
## target simple

cached(cache = faster_cache)
## [1] "simple"

readd(simple, cache = faster_cache)
## [1] 2

Recovering the cache

You can recover an old cache from the file system. You could use storr::storr_rds() directly if you know the short hash algorithm, but this_cache() and recover_cache() are safer for drake. get_cache() is similar, but it has a slightly different interface.

old_cache <- this_cache("faste_cache") # Get a cache you know exists...
recovered <- recover_cache("faster_cache") # or create a new one if missing.

Custom storr caches

If you want bypass drake and generate a cache directly from storr, it is best to do so right from the beginning.

library(storr)
my_storr <- storr_rds("my_storr", mangle_key = TRUE)
make(new_plan, cache = my_storr)
## Unloading targets from environment:
##   simple
## connect 65 imports: my_storr, additions, output_plan, data_plan, faster_cache...
## connect 1 target: simple
## check 1 item: simple
## target simple

cached(cache = my_storr)
## [1] "simple"

readd(simple, cache = my_storr)
## [1] 2

In addition to storr_rds(), drake supports in-memory caches created from storr_environment(). However, parallel computing is not supported these caches. The jobs argument must be 1, and the parallelism argument must be either "mclapply" or "parLapply". (It is sufficient to leave the default values alone.)

memory_cache <- storr_environment()
other_plan <- drake_plan(
  some_data = rnorm(50),
  more_data = rpois(75, lambda = 10),
  result = mean(c(some_data, more_data))
)

make(other_plan, cache = memory_cache)
## connect 68 imports: my_storr, additions, output_plan, data_plan, faster_cache...
## connect 3 targets: some_data, more_data, result
## check 4 items: c, mean, rnorm, rpois
## check 2 items: more_data, some_data
## target more_data
## target some_data
## check 1 item: result
## target result

cached(cache = memory_cache)
## [1] "c"         "mean"      "more_data" "result"    "rnorm"     "rpois"    
## [7] "some_data"

readd(result, cache = memory_cache)
## [1] 5.975047

In theory, it should be possible to leverage serious databases using storr_dbi(). However, if you use such caches, please heed the following.

  1. Be sure you have storr version 1.1.3 or greater installed.
  2. Do not use parallel computing. In other words, leave the parallelism and jobs arguments to make() as the defaults. This is because storr_dbi() caches have a small number of files internally, so several parallel processes could try to write to the same file at the same time. So far, only storr_rds() caches (default) were designed for use with parallel computing.

The following example requires the DBI and RSQLite packages.

mydb <- DBI::dbConnect(RSQLite::SQLite(), "my-db.sqlite")
cache <- storr::storr_dbi(
  tbl_data = "data",
  tbl_keys = "keys",
  con = mydb
)
load_basic_example() # Get the code with drake_example("basic").
unlink(".drake", recursive = TRUE)
make(my_plan, cache = cache)

Cleaning up

If you want to start from scratch, you can clean() the cache. Use the destroy argument to remove it completely. cache$del() and cache$destroy() are also options, but they leave output file targets dangling. By contrast, clean(destroy = TRUE) removes file targets generated by drake::make(). drake_gc() and clean(..., garbage_collection = TRUE) do garbage collection, and clean(purge = TRUE) removes all target-level data, not just the final output values.

clean(small, large)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake

cached() # 'small' and 'large' are gone
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake
##  [1] "'report.Rmd'"           "'report.md'"           
##  [3] "coef_regression1_large" "coef_regression1_small"
##  [5] "coef_regression2_large" "coef_regression2_small"
##  [7] "data.frame"             "knit"                  
##  [9] "lm"                     "mtcars"                
## [11] "nrow"                   "reg1"                  
## [13] "reg2"                   "regression1_large"     
## [15] "regression1_small"      "regression2_large"     
## [17] "regression2_small"      "sample.int"            
## [19] "simulate"               "summ_regression1_large"
## [21] "summ_regression1_small" "summ_regression2_large"
## [23] "summ_regression2_small" "summary"               
## [25] "suppressWarnings"

clean(destroy = TRUE)
## cache /tmp/Rtmp7EbM7A/Rbuild75ca54d145b9/drake/vignettes/.drake

clean(destroy = TRUE, cache = faster_cache)
clean(destroy = TRUE, cache = my_storr)