% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/DistToNearest.R
\name{distToNearest}
\alias{distToNearest}
\title{Distance to nearest neighbor}
\usage{
distToNearest(
  db,
  sequenceColumn = "JUNCTION",
  vCallColumn = "V_CALL",
  jCallColumn = "J_CALL",
  model = c("ham", "aa", "hh_s1f", "hh_s5f", "mk_rs1nf", "mk_rs5nf", "m1n_compat",
    "hs1f_compat"),
  normalize = c("len", "none"),
  symmetry = c("avg", "min"),
  first = TRUE,
  VJthenLen = TRUE,
  nproc = 1,
  fields = NULL,
  cross = NULL,
  mst = FALSE,
  subsample = NULL,
  progress = FALSE,
  cellIdColumn = NULL,
  locusColumn = NULL,
  groupUsingOnlyIGH = TRUE,
  keepVJLgroup = TRUE
)
}
\arguments{
\item{db}{data.frame containing sequence data.}

\item{sequenceColumn}{name of the column containing the junction for grouping and for calculating
nearest neighbot distances. Note that while both heavy and light chain junctions
may be used for VJL grouping, only the heavy chain junction is used to calculate 
distances.}

\item{vCallColumn}{name of the column containing the V-segment allele calls.}

\item{jCallColumn}{name of the column containing the J-segment allele calls.}

\item{model}{underlying SHM model, which must be one of 
\code{c("ham", "aa", "hh_s1f", "hh_s5f", "mk_rs1nf", "hs1f_compat", "m1n_compat")}.
See Details for further information.}

\item{normalize}{method of normalization. The default is \code{"len"}, which 
divides the distance by the length of the sequence group. If 
\code{"none"} then no normalization if performed.}

\item{symmetry}{if model is hs5f, distance between seq1 and seq2 is either the
average (avg) of seq1->seq2 and seq2->seq1 or the minimum (min).}

\item{first}{if \code{TRUE} only the first call of the gene assignments 
is used. if \code{FALSE} the union of ambiguous gene 
assignments is used to group all sequences with any 
overlapping gene calls.}

\item{VJthenLen}{a Boolean value specifying whether to perform partitioning as a 2-stage
process. If \code{TRUE}, partitions are made first based on V and J
annotations, and then further split based on junction lengths corresponding 
to \code{sequenceColumn}. If \code{FALSE}, perform partition as a 1-stage 
process during which V annotation, J annotation, and junction length are used 
to create partitions simultaneously. Defaults to \code{TRUE}.}

\item{nproc}{number of cores to distribute the function over.}

\item{fields}{additional fields to use for grouping.}

\item{cross}{character vector of column names to use for grouping to calculate 
distances across groups. Meaning the columns that define self versus others.}

\item{mst}{if \code{TRUE}, return comma-separated branch lengths from minimum 
spanning tree.}

\item{subsample}{number of sequences to subsample for speeding up pairwise-distance-matrix calculation. 
Subsampling is performed without replacement in each VJL group of heavy chain sequences. 
If \code{subsample} is larger than the unique number of heavy chain sequences in each 
VJL group, then the subsampling process is ignored for that group. For each heavy chain
sequence in \code{db}, the reported \code{DIST_NEAREST} is the distance to the closest
heavy chain sequence in the subsampled set for the VJL group. If \code{NULL} no 
subsampling is performed.}

\item{progress}{if \code{TRUE} print a progress bar.}

\item{cellIdColumn}{name of the column containing cell IDs. Only applicable and required for 
single-cell mode.}

\item{locusColumn}{name of the column containing locus information. Only applicable and 
required for single-cell mode.}

\item{groupUsingOnlyIGH}{use only heavy chain (\code{IGH}) sequences for VJL grouping, disregarding 
light chains. Only applicable and required for single-cell mode. 
Default is \code{TRUE}. Also see \link[alakazam]{groupGenes}.}

\item{keepVJLgroup}{a Boolean value specifying whether to keep in the output the the column 
column indicating grouping based on VJL combinations. Only applicable for
1-stage partitioning (i.e. \code{VJthenLen=FALSE}). Also see 
\link[alakazam]{groupGenes}.}
}
\value{
Returns a modified \code{db} data.frame with nearest neighbor distances between heavy chain
          sequences in the \code{DIST_NEAREST} column if \code{cross=NULL}. If \code{cross} was 
          specified, distances will be added as the \code{CROSS_DIST_NEAREST} column. 
          
          Note that distances between light chain sequences are not calculated, even if light chains 
          were used for VJL grouping via \code{groupUsingOnlyIGH=FALSE}. Light chain sequences, if any,
          will have \code{NA} in the \code{DIST_NEAREST} field.
}
\description{
Get non-zero distance of every heavy chain (\code{IGH}) sequence (as defined by 
\code{sequenceColumn}) to its nearest sequence in a partition of heavy chains sharing the same 
V gene, J gene, and junction length (VJL), or in a partition of single cells with heavy chains
sharing the same heavy chain VJL combination, or of single cells with heavy and light chains 
sharing the same heavy chain VJL and light chain VJL combinations.
}
\details{
To invoke single-cell mode, both \code{cellIdColumn} and \code{locusColumn} must be supplied. 
Otherwise, the function will run under non-single-cell mode.

Under single-cell mode, only heavy chain sequences will be used for calculating nearest neighbor
distances. Under non-single-cell mode, all input sequences will be used for calculating nearest
neighbor distances, regardless of the values in the \code{locusColumn} field (if present).

For single-cell mode, the input format is the same as that for \link[alakazam]{groupGenes}. 
Namely, each row represents a sequence/chain. Sequences/chains from the same cell are linked
by a cell ID in the \code{cellIdColumn} field. Under this mode, there is a choice of whether 
grouping should be done using only heavy chain (\code{IGH}) sequences only, or using both 
heavy chain (\code{IGH}) and light chain (\code{IGK}, \code{IGL}) sequences. This is governed 
by \code{groupUsingOnlyIGH}.

If used, values in the \code{locusColumn} column must be one of \code{"IGH"}, \code{"IGK"}, and \code{"IGL"}.

Note that for \code{distToNearest}, a cell with multiple heavy chains is not allowed.

The distance to nearest (heavy chain) neighbor can be used to estimate a threshold for assigning 
Ig sequences to clonal groups. A histogram of the resulting vector is often bimodal, with the 
ideal threshold being a value that separates the two modes.

The following distance measures are accepted by the \code{model} parameter.

\itemize{
  \item \code{"ham"}:          Single nucleotide Hamming distance matrix from \link[alakazam]{getDNAMatrix} 
                               with gaps assigned zero distance.
  \item \code{"aa"}:           Single amino acid Hamming distance matrix from \link[alakazam]{getAAMatrix}.
  \item \code{"hh_s1f"}:       Human single nucleotide distance matrix derived from \link{HH_S1F} with 
                               \link{calcTargetingDistance}.
  \item \code{"hh_s5f"}:       Human 5-mer nucleotide context distance matix derived from \link{HH_S5F} with 
                               \link{calcTargetingDistance}.
  \item \code{"mk_rs1nf"}:     Mouse single nucleotide distance matrix derived from \link{MK_RS1NF} with 
                               \link{calcTargetingDistance}.
  \item \code{"mk_rs5nf"}:     Mouse 5-mer nucleotide context distance matrix derived from \link{MK_RS1NF} with 
                               \link{calcTargetingDistance}.
  \item \code{"hs1f_compat"}:  Backwards compatible human single nucleotide distance matrix used in 
                               SHazaM v0.1.4 and Change-O v0.3.3.
  \item \code{"m1n_compat"}:   Backwards compatibley mouse single nucleotide distance matrix used in 
                               SHazaM v0.1.4 and Change-O v0.3.3.
}

Note on \code{NA}s: if, for a given combination of V gene, J gene, and sequence length,
there is only 1  heavy chain sequence (as defined by \code{sequenceColumn}), \code{NA} is 
returned instead of a distance (since it has no heavy chain neighbor). If for a given combination 
there are multiple heavy chain sequences but only 1 unique one, (in which case every heavy cahin 
sequence in this group is the de facto nearest neighbor to each other, thus giving rise to distances 
of 0), \code{NA}s are returned instead of zero-distances.

Note on \code{subsample}: Subsampling is performed independently in each VJL group for heavy chain
sequences. If \code{subsample} is larger than number of heavy chain sequences in the group, it is 
ignored. In other words, subsampling is performed only on groups in which the number of heavy chain 
sequences is equal to or greater than \code{subsample}. \code{DIST_NEAREST} has values calculated 
using all heavy chain sequences in the group for groups with fewer than \code{subsample} heavy chain
sequences, and values calculated using a subset of heavy chain sequences for the larger groups. 
To select a value of \code{subsample}, it can be useful to explore the group sizes in \code{db} 
(and the number of heavy chain sequences in those groups).
}
\examples{
# Subset example data to one sample as a demo
data(ExampleDb, package="alakazam")
db <- subset(ExampleDb, SAMPLE == "-1h")

# Use genotyped V assignments, Hamming distance, and normalize by junction length
# First partition based on V and J assignments, then by junction length
# Take into consideration ambiguous V and J annotations
dist <- distToNearest(db, vCallColumn="V_CALL_GENOTYPED", model="ham", 
                      first=FALSE, VJthenLen=TRUE, normalize="len")
                           
# Plot histogram of non-NA distances
p1 <- ggplot(data=subset(dist, !is.na(DIST_NEAREST))) + 
      theme_bw() + 
      ggtitle("Distance to nearest: Hamming") + 
      xlab("distance") +
      geom_histogram(aes(x=DIST_NEAREST), binwidth=0.025, 
                     fill="steelblue", color="white")
plot(p1)

}
\references{
\enumerate{
  \item  Smith DS, et al. Di- and trinucleotide target preferences of somatic 
           mutagenesis in normal and autoreactive B cells. 
           J Immunol. 1996 156:2642-52. 
  \item  Glanville J, Kuo TC, von Budingen H-C, et al. 
           Naive antibody gene-segment frequencies are heritable and unaltered by 
           chronic lymphocyte ablation. 
           Proc Natl Acad Sci USA. 2011 108(50):20066-71.
  \item  Yaari G, et al. Models of somatic hypermutation targeting and substitution based 
           on synonymous mutations from high-throughput immunoglobulin sequencing data. 
           Front Immunol. 2013 4:358.
 }
}
\seealso{
See \link{calcTargetingDistance} for generating nucleotide distance matrices 
          from a \link{TargetingModel} object. See \link{HH_S5F}, \link{HH_S1F}, 
          \link{MK_RS1NF}, \link[alakazam]{getDNAMatrix}, and \link[alakazam]{getAAMatrix}
          for individual model details.
}
