% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/check_CH.R
\name{check_CH}
\alias{check_CH}
\title{Check Calinski-Harabasz index}
\usage{
check_CH(
  data,
  sample_id,
  samples_col = "Sample",
  abundance_col = "Abundance",
  range = 2:10,
  with_plot = FALSE,
  ...
)
}
\arguments{
\item{data}{A data.frame with, at least, a column for Abundance and Sample. Additional columns are allowed.}

\item{sample_id}{String with name of the sample to apply this function.}

\item{samples_col}{String with name of column with sample names.}

\item{abundance_col}{String with name of column with abundance values.}

\item{range}{The range of values of k to test, default is from 2 to 10.}

\item{with_plot}{If FALSE (default) returns a vector, but if TRUE will return a plot with the scores.}

\item{...}{Extra arguments.}
}
\value{
Vector or plot with Calinski-Harabasz index for each pre-specified k.
}
\description{
Calculates Calinski-Harabasz pseudo F-statistic (CH) for a given sample
}
\details{
CH is an index used to decide the number of clusters in a clustering algorithm.
This function, \code{\link[=check_CH]{check_CH()}}, calculates the CH index for every k in a pre-specified range
of values. Thus providing a score for each number of clusters tested (k). The default
range of cluster values (k) is \code{range = 2:10} (see why this is in Pascoal et al., 2024, in peer review).
However, this function may calculate the CH index for all possible k's.

Note that CH index is not an absolute value that indicates the quality of a single clustering.
Instead, it allows the comparison of clustering results. Thus, if you have several clusterings, the
best one will be the one with higher CH index.

\strong{Data input}

This function takes a data.frame with a column for samples and a column for abundance
(minimum), but can take any number of other columns. It will then filter the specific sample
that you want to analyze. You can also pre-filter for your specific sample, but you still need to
provide the sample ID (sample_id) and the table always needs a column for Sample and another for Abundance
(indicate how you name them with the arguments samples_col and abundance_col).

\strong{Output options}

The default option returns a vector with CH scores for each k. This is a simple output that can then be used
for other analysis. However, we also provide the option to show a plot (set \code{with_plot = TRUE}) with
the CH score for each k.

\strong{Explanation of Calinski-Harabasz index}

The CH index is a \strong{variance ratio criterion}, it measures both \strong{separation} and \strong{density} of the clusters.
The higher, the better, because it means that the points within the same cluster are close to each other; and
the different clusters are well separated.

You can see CH index as:

\deqn{CH = \frac{\text{inter cluster dispersion}}{\text{intra cluster dispersion}}}

To calculate inter-cluster:

Let \eqn{k} be the number of clusters and BGSS be the Between-group sum of squares,

inter-cluster dispersion is \deqn{\frac{BGSS}{(k-1)}}

To calculate BGSS:

Let \eqn{n_k} be the number of observations in a cluster,
\eqn{C} be the centroid of the dataset (barycenter) and \eqn{C_k} the centroid of
a cluster,

\deqn{BGSS = \sum_{k = 1}^{k}{n_k * \left\lvert C_k-C \right\rvert^2}}

Thus, the BGSS multiplies the distance between the cluster centroid and
the centroid of the whole dataset, by all observations in a given cluster,
for all clusters.

To calculate intra-cluster dispersion:

Let \eqn{WGSS} be the Within Group Sum of Squares and \eqn{N} be the total
number of observations in the dataset.

intra-cluster dispersion

\deqn{\frac{WGSS}{(N-1)}}

Let \eqn{X_ik} be i'th observation of a cluster and
\eqn{n_k} be the number of observations in a cluster.

\deqn{WGSS = \sum_{k=1}^{k}\sum_{i=1}^{n_k}\left\lvert X_ik - C_k \right\rvert}

Thus, WGSS measures the distance between observations and their cluster center; if divided by the
total number of observations, then gives a sense of intra-dispersion.

Finally, the CH index can be given by:

\deqn{CH = \frac{\sum_{k = 1}^{k}{n_k * \left\lvert C_k-C \right\rvert^2}}
 {\sum_{k=1}^{k}\sum_{i=1}^{n_k}\left\lvert X_ik - C_k \right\rvert}
 \frac{(N-k)}{(k-1)}}
}
\examples{
library(dplyr)
# Just scores
check_CH(nice_tidy, sample_id = "ERR2044662")

# To change range
check_CH(nice_tidy, sample_id = "ERR2044662", range = 4:11)

# To see a simple plot
check_CH(nice_tidy, sample_id = "ERR2044662", range = 4:11, with_plot=TRUE)


}
\references{
Calinski, T., & Harabasz, J. (1974). A dendrite method for cluster analysis. Communications in Statistics - Theory and Methods, 3(1), 1–27.
Pascoal, F., Branco, P., Torgo, L. et al. Definition of the microbial rare biosphere through unsupervised machine learning. Commun Biol 8, 544 (2025). https://doi.org/10.1038/s42003-025-07912-4
}
\seealso{
\link[clusterSim:index.G1]{clusterSim::index.G1}
}
