% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/CDF.R
\name{CDF}
\alias{CDF}
\title{the  Comparison Data Forest (CDF) Approach}
\usage{
CDF(
  response,
  num.trees = 500,
  mtry = "sqrt",
  nfact.max = 10,
  N.pop = 10000,
  N.Samples = 500,
  cor.type = "pearson",
  use = "pairwise.complete.obs",
  vis = TRUE,
  plot = TRUE
)
}
\arguments{
\item{response}{A required \code{N} × \code{I} matrix or data.frame consisting of the responses of \code{N} individuals
to × \code{I} items.}

\item{num.trees}{the number of trees in the Random Forest. (default = 500) See details.}

\item{mtry}{the maximum depth for each tree, can be a number or a character (\code{"sqrt"}).
When \code{mtry = "sqrt"}, it means that the maximum depth of each tree will be determined by the square root of
the number of available features (converted to an integer by \code{\link[base]{round}}).
default = \code{"sqrt"}. See details.}

\item{nfact.max}{The maximum number of factors discussed by CDF approach. (default = 10)}

\item{N.pop}{Size of finite populations of simulating. (default = 10,000)}

\item{N.Samples}{Number of samples drawn from each population. (default = 500)
Each Sample is consisted of a \code{N} × \code{I} matrix, just like the empirical data.}

\item{cor.type}{A character string indicating which correlation coefficient (or covariance) is
to be computed. One of \code{"pearson"} (default), \code{"kendall"}, or
\code{"spearman"}. @seealso \code{\link[stats]{cor}}.}

\item{use}{an optional character string giving a method for computing covariances in the presence of missing values. This
must be one of the strings \code{"everything"}, \code{"all.obs"}, \code{"complete.obs"}, \code{"na.or.complete"},
or \code{"pairwise.complete.obs"} (default). @seealso \code{\link[stats]{cor}}.}

\item{vis}{A Boolean variable that will print the factor retention results when set to \code{TRUE}, and will not print
when set to \code{FALSE}. (default = \code{TRUE})}

\item{plot}{A Boolean variable that will print the CDF plot when set to \code{TRUE}, and will not print it when set to
\code{FALSE}. @seealso \code{\link[EFAfactors]{plot.CDF}}. (Default = \code{TRUE})}
}
\value{
An object of class \code{CDF} is a \code{list} containing the following components:
\item{nfact}{The number of factors to be retained.}
\item{RF}{the trained Random Forest model}
\item{probability}{A matrix containing the probabilities for factor numbers ranging from 1
                   to nfact.max (1xnfact.max), where the number in the f-th column represents the probability
                   that the number of factors for the response is f.}
\item{features}{A matrix (1×181) containing all the features for determining the number of
      factors. @seealso \code{\link[EFAfactors]{extractor.feature.FF}}}
}
\description{
The Comparison Data Forest (CDF; Goretzko & Ruscio, 2019) approach is a combination of Random Forest with the Comparison Data (CD) approach.
}
\details{
The Comparison Data Forest (CDF; Goretzko & Ruscio, 2019) Approach is a combination of
random forest with the Comparison Data (CD) approach. Its basic steps involve using the method
of Ruscio & Roche (2012) to simulate data with different factor counts, then extracting features
from this data to train a random forest model. Once the model is trained, it can be used to
predict the number of factors in empirical data. The algorithm consists of the following steps:

1. **Simulation Data:**

\describe{
   \item{(1)}{For each value of \eqn{nfact} in the range from 1 to \eqn{nfact_{max}},
              generate a population data using the \code{\link[EFAfactors]{GenData}} function.}
   \item{(2)}{Each population (\eqn{N_{pop}×I}) is based on \eqn{nfact} factors and consists of \eqn{N_{pop}} observations.}
   \item{(3)}{For each generated population, repeat the following for \eqn{N_{sam}} times, For the \eqn{j}-th:
               a. Draw a sample population \eqn{N×I} from the population that matches the size of the empirical data;
               b. Compute a feature set \eqn{\mathbf{fea}_{nfact,j}}.}
   \item{(4)}{Combine all the generated feature sets \eqn{\mathbf{fea}_{nfact,j}}
              into a data frame as \eqn{\mathbf{data}_{train, nfact}}.}
   \item{(5)}{Combine all \eqn{\mathbf{data}_{train, nfact}} into a final data frame as the training datasets \eqn{\mathbf{data}_{train}}.}
}

2. **Training RF:**

   Train a Random Forest model \eqn{RF} using the combined \eqn{\mathbf{data}_{train}}.

3. **Prediction the Empirical Data:**

\describe{
   \item{(1)}{Calculate the feature set \eqn{\mathbf{fea}_{emp}}for the empirical data.}
   \item{(2)}{Use the trained Random Forest model \eqn{RF} to predict the number of factors \eqn{nfact_{emp}} for the empirical data:
              \deqn{nfact_{emp} = RF(\mathbf{fea}_{emp})}}
}

According to Goretzko & Ruscio (2024) and Breiman (2001), the number of
trees in the Random Forest \code{num.trees} is recommended to be 500.
The Random Forest in CDF performs a classification task, so the recommended maximum
depth for each tree \code{mtry} is \eqn{\sqrt{q}} (where \eqn{q} is the number of features),
which results in \eqn{m_{try}=\sqrt{181}=13}.

Since the CDF approach requires extensive data simulation and computation, which is much more time consuming
than the \code{\link[EFAfactors]{CD}} Approach, C++ code is used to speed up the process.
}
\references{
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5-32. https://doi.org/10.1023/A:1010933404324

Goretzko, D., & Ruscio, J. (2024). The comparison data forest: A new comparison data approach to determine the number of factors in exploratory factor analysis. Behavior Research Methods, 56(3), 1838-1851. https://doi.org/10.3758/s13428-023-02122-4

Ruscio, J., & Roche, B. (2012). Determining the number of factors to retain in an exploratory factor analysis using comparison data of known factorial structure. Psychological Assessment, 24, 282–292. http://dx.doi.org/10.1037/a0025697.
}
\seealso{
\code{\link[EFAfactors]{GenData}}
}
\author{
Haijiang Qin <Haijiang133@outlook.com>
}
