qs2: a framework for efficient serialization
qs2
is the successor to the qs
package. The
goal is to have reliable and fast performance for saving and loading
objects in R.
The qs2
format directly uses R serialization (via the
R_Serialize
/R_Unserialize
C API) while
improving underlying compression and disk IO patterns. If you are
familiar with the qs
package, the benefits and usage are
the same.
qs_save(data, "myfile.qs2")
<- qs_read("myfile.qs2") data
Use the file extension qs2
to distinguish it from the
original qs
package. It is not compatible with the original
qs
format.
install.packages("qs2")
On x64 Mac or Linux, you can enable multi-threading by compiling from source. It is enabled by default on Windows.
::install_cran("qs2", type = "source", configure.args = "--with-TBB --with-simd=AVX2") remotes
On non-x64 systems (e.g. Mac ARM) remove the AVX2 flag.
::install_cran("qs2", type = "source", configure.args = "--with-TBB") remotes
Multi-threading in qs2
uses the
Intel Thread Building Blocks
framework via the
RcppParallel
package.
Because the qs2
format directly uses R serialization,
you can convert it to RDS and vice versa.
<- tempfile(fileext = ".qs2")
file_qs2 <- tempfile(fileext = ".RDS")
file_rds <- runif(1e6)
x
# save `x` with qs_save
qs_save(x, file_qs2)
# convert the file to RDS
qs_to_rds(input_file = file_qs2, output_file = file_rds)
# read `x` back in with `readRDS`
<- readRDS(file_rds)
xrds stopifnot(identical(x, xrds))
The qs2
format saves an internal checksum. This can be
used to test for file corruption before deserialization via the
validate_checksum
parameter, but has a minor performance
penalty.
qs_save(data, "myfile.qs2")
<- qs_read("myfile.qs2", validate_checksum = TRUE) data
The package also introduces the qdata
format which has
its own serialization layout and works with only data types (vectors,
lists, data frames, matrices).
It will replace internal types (functions, promises, external
pointers, environments, objects) with NULL. The qdata
format differs from the qs2
format in that it is NOT a
general.
The eventual goal of qdata
is to also have
interoperability with other languages, particularly
Python
.
qd_save(data, "myfile.qs2")
<- qd_read("myfile.qs2") data
A summary across 4 datasets is presented below.
Algorithm | Compression | Save Time (s) | Read Time (s) |
---|---|---|---|
qs2 | 7.96 | 13.4 | 50.4 |
qdata | 8.45 | 10.5 | 34.8 |
base::serialize | 1.1 | 8.87 | 51.4 |
saveRDS | 8.68 | 107 | 63.7 |
fst | 2.59 | 5.09 | 46.3 |
parquet | 8.29 | 20.3 | 38.4 |
qs (legacy) | 7.97 | 9.13 | 48.1 |
Algorithm | Compression | Save Time (s) | Read Time (s) |
---|---|---|---|
qs2 | 7.96 | 3.79 | 48.1 |
qdata | 8.45 | 1.98 | 33.1 |
fst | 2.59 | 5.05 | 46.6 |
parquet | 8.29 | 20.2 | 37.0 |
qs (legacy) | 7.97 | 3.21 | 52.0 |
qs2
, qdata
and qs
with
compress_level = 3
parquet
via the arrow
package using zstd
compression_level = 3
base::serialize
with ascii = FALSE
and
xdr = FALSE
Datasets used
1000 genomes non-coding VCF
1000 genomes non-coding
variants (2743 MB)B-cell data
B-cell mouse data, Greiff 2017 (1057
MB)IP location
IPV4 range data with location information
(198 MB)Netflix movie ratings
Netflix ML prediction dataset
(571 MB)These datasets are openly licensed and represent a combination of
numeric and text data across multiple domains. See
inst/analysis/datasets.R
on Github.
Serialization functions can be accessed in compiled code. Below is an example using Rcpp.
// [[Rcpp::depends(qs2)]]
#include <Rcpp.h>
#include "qs2_external.h"
using namespace Rcpp;
// [[Rcpp::export]]
(SEXP x) {
SEXP test_qs_serializesize_t len = 0;
unsigned char * buffer = c_qs_serialize(x, &len, 10, true, 4); // object, buffer length, compress_level, shuffle, nthreads
= c_qs_deserialize(buffer, len, false, 4); // buffer, buffer length, validate_checksum, nthreads
SEXP y (buffer); // must manually free buffer
c_qs_freereturn y;
}
// [[Rcpp::export]]
(SEXP x) {
SEXP test_qd_serializesize_t len = 0;
unsigned char * buffer = c_qd_serialize(x, &len, 10, true, 4); // object, buffer length, compress_level, shuffle, nthreads
= c_qd_deserialize(buffer, len, false, false, 4); // buffer, buffer length, use_alt_rep, validate_checksum, nthreads
SEXP y (buffer); // must manually free buffer
c_qd_freereturn y;
}
/*** R
x <- runif(1e7)
stopifnot(test_qs_serialize(x) == x)
stopifnot(test_qd_serialize(x) == x)
*/