Find non-linear formulas that fits your input data. You can systematically explore and memoize the possible formulas and it’s cross-validation performance, in a parallel and incremental fashon. Three interoperable search functions are available:
random.search performs a random exploration,genetic.search employs a genetic optimization
algorithmexaustive.search explore all the space not already
explored (WARNING may take a long time and doesn’t have intermediate
save points.)The library is designed with a massive parallelization use in mind,
that’s why you have a base.filepath for shared results
among all parallel processors, and a res.filepath with
local results only.
Periodically, the multiple res.filepath will be joined
together in a shared base.filepath.
Let’s broke this down step by step.
l1.filepath <- paste0('regression/regression',type,'.exploration.l1.rData')
# create file
saveRDS(empty.sample(), l1.filepath)
K=15
N=30
max.formula.len=2
Note that you can control the formula space, and the algorithm employed by just setting three variables
Thus, you can run multiple independent search processes both on your computer or server-side with a command similar to this one
R --no-echo --no-restore --file=find_f.R --args 3 2 1006
That will run genetic search to find formulas of max len 3, and order up to 2.
You just need to get the variables from command-line
options <- commandArgs(trailingOnly = TRUE)
formula.len <- as.integer(options[1])
n.squares <- as.integer(options[2])
seed <- as.integer(options[3])
otherwise, you can hard-code them
best.n.squares <- n.squares <- 1
best.formula.len <- formula.len <- 2
seed=1010
Used only in genetic and random search. Exaustive search will append
the missing formulas to the file pointed by
base.filepath.
if(seed > 0){
res.filepath <- file.path(out.folder, paste0('regression',type,'.exploration.fl.',formula.len,'.ord.',n.squares,'seed.',seed,'.rData'))
}
rdf is the full datasetx is the variable to which we should apply the
non-linearityz is a list with min and
absmin fieldstransformations <- list(
"log"=function(rdf, x, z){
log10(x)
},
"log10"=function(rdf, x, z){
log10(0.1+abs(z$min)+x)
},
"log_fwhm_fcrn_p"=function(rdf, x, z){
pos_x1 <- abs(min(rdf$x1, na.rm = T)) + rdf$x1
log10(0.1 + abs(z$min) + x + pos_x1)
},
"my_log10"=function(rdf, x, z){
# WHATEVER fancy function you like
mask <- rdf$x1 > 0.1
mask[mask == 0] <- NA
x.nyte <- mask * x
l <- log10(0.1+abs(z$min)+x.nyte)
l[is.na(l)] <- 0
return(l)
},
"inv"=function(rdf, x, z){
1/(0.1+abs(z$min)+x)
},
"sigmoid"=function(rdf, x,z){
1 / ( 1 + exp(-x))
},
"invsigmoid"=function(rdf, x,z){
1/(1 + exp(x))
}
)
Based on seed value, run a different algorithm
if(seed == 0){
base.filepath <- paste0('regression/regression',type,'.exploration.l',formula.len,'.rData')
res.new <- exaustive.search(regressors.df, l.Fn,
n.squares=n.squares,
formula.len=formula.len,
K=K, N=N, seed=seed,
transformations=transformations,
custom.abs.mins=list(),
glob.filepath = base.filepath,
chunk.size=NULL, cv.norm=T)
}else if(seed < 1000){
base.filepath <- paste0('regression/regression',type,'.exploration.l',formula.len,'.rData')
# general "random" search to find good candidates
new.sample.res <- random.search(regressors.df, l.Fn, n.squares, formula.len,
maxiter=10, K=K, N=N,
transformations=transformations,
glob.filepath = base.filepath,
local.filepath = res.filepath,
memoization=T, cv.norm = T)
}else{
# genetic can change formula length
base.filepath <- c(
l1.filepath,
l2.filepath,
l3.filepath#,
#l4.filepath
)
# finetuning procedure using genetic algorithm
best.vars.l <- list(
c('log.x1') )
best.finetuned <- genetic.search(
regressors.df,
l.Fn,
n.squares=n.squares,
max.formula.len=max.formula.len,
maxiter=100,
transformations=transformations,
glob.filepath=base.filepath,
local.filepath=res.filepath,
memoization=T,
pcrossover=0.1,
pmutation=0.9,
seed=NULL,
keepBest=T,
K=K,
N=N,
popSize = 100,
best.vars.l=best.vars.l,
cv.norm = T
)
}