% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/check_taxonomy.R
\name{check_taxonomy}
\alias{check_taxonomy}
\title{check_taxonomy}
\usage{
check_taxonomy(
  x,
  ranks = c("phylum", "class", "order", "family", "genus"),
  species = FALSE,
  species_sep = NULL,
  routine = c("format_check", "spell_check", "discrete_ranks", "find_duplicates"),
  report = TRUE,
  verbose = TRUE,
  clean_name = FALSE,
  clean_spell = FALSE,
  thresh = NULL,
  resolve_duplicates = FALSE,
  append = TRUE,
  term_set = NULL,
  collapse_set = NULL,
  jw = 0.1,
  str = 1,
  str2 = NULL,
  alternative = "jaccard",
  q = 1,
  pref_set = NULL,
  suff_set = NULL,
  exclude_set = NULL,
  jump = 3,
  plot = FALSE
)
}
\arguments{
\item{x}{A dataframe with hierarchically organised
taxonomic information. If x only comprises the taxonomic
information, @param ranks does not need to be specified, but the
columns must be in order of decreasing taxonomic rank}

\item{ranks}{The column names of the taxonomic data fields
in x. These must be provided in order of decreasing taxonomic
rank}

\item{species}{A logical indicating if x contains a species
column. As the data must be supplied in hierarchical order,
this column will naturally be the last column in x and
species-specific spell checks will be performed on this column.
NOTE that for the function to work, the species name must be the
full species name rather than just the specific epithet, e.g.,
'Tyto_alba' rather than just 'alba'.}

\item{species_sep}{A character vector of length one specifying
the genus name and specific epithet in the species column
\itemize{
\item Flagging routine arguments *
}}

\item{routine}{A character vector determining the flagging
and cleaning routines to employ. Valid values are format_check (check for
non letter characters and the number of words in names), spell_check
(flag potential spelling errors), discrete_rank (check that
taxonomic names are unique to their rank), duplicate_tax (flag
conflicting higher classifications of a given taxon)}

\item{report}{A logical of length one determining if the flagging outputs
of each cleaning routine should be returned to the user for inspection.
This is different to @param verbose, which controls whether flagging
should additionally be reported to the user on the console}

\item{verbose}{A logical determining if function progress and flagged
errors should be reported to the console
\itemize{
\item Cleaning routine arguments *
}}

\item{clean_name}{If TRUE, the function will return cleaned versions of
the columns in x using the routines in @seealso clean_name. These routines
can be altered using the 'term_set' and 'collapse_set' arguments.}

\item{clean_spell}{If TRUE, the function will return a cleaned version
of the supplied taxonomic dataframe, using the supplied threshold
for the similarity method given by method2, to automatically update
any names in pairs of flagged synonyms to the more frequent spelling.
This is not recommended, however, so the argument is FALSE by default
and the threshold left as NULL}

\item{thresh}{The threshold for the similarity method given by method2,
below which flagged pairs of names will be considered synonyms and
resolved automatically. See @seealso spell_check for details on method2}

\item{resolve_duplicates}{If TRUE, the function will return a cleaned version
of the supplied taxonomic dataframe, using @seealso resolve_duplicates
to resolve conflicts in the way documented by the function. Both
spell_clean and tax_clean can both be TRUE to return a dataset cleaned
by both methods}

\item{append}{If TRUE, any suffixes used during cleaning will
be retained in the cleaned version of the data. This is preferable
as it ensures that all taxonomic names are rank-discrete and
uniquely classified
\itemize{
\item Routine specific arguments *
}}

\item{term_set}{A character vector of terms (to be
used at all ranks) or a list of rank-specific terms
which will be supplied, element-wise as the @param collapse argument
called by @seealso clean_name. If a list, this}

\item{collapse_set}{A character vector of character strings (to be
used at all ranks) or a list of rank-specific strings
which will be supplied, element-wise as the @param collapse argument
called by @seealso clean_name. If a list, this
should be given in descending rank order}

\item{jw}{Called by @seealso spell_check}

\item{str}{Called by @seealso spell_check}

\item{str2}{Called by @seealso spell_check}

\item{alternative}{Called by @seealso spell_check}

\item{q}{Called by @seealso spell_check}

\item{pref_set}{A character vector of prefixes (which
will be used at all ranks) or a list of rank-specific prefixes,
which will be supplied, element-wise as the @param pref argument
called by @seealso spell_check. If a list, this
should be given in descending rank order.}

\item{suff_set}{A character vector of suffixes (which
will be used at all ranks) or a list of rank-specific suffixes,
which will be supplied, element-wise as the @param suff argument
called by @seealso spell_check. If a list, this
should be given in descending rank order.}

\item{exclude_set}{A character vector of terms to exclude (which
will be used at all ranks) or a list of rank-specific exclusion terms,
which will be supplied, element-wise as the @param exclude argument
called by @seealso spell_check. If a list, this
should be given in descending rank order.}

\item{jump}{Called by @seealso resolve_duplicates}

\item{plot}{Called by @seealso resolve_duplicates}
}
\value{
A list with elements corresponding to the outputs of the chosen
flagging routines (four by default: $formatting, $synonyms, $ranks, $duplicates),
plus a cleaned verison of the data ($data) if any of clean_name, clean_spell
or resolve_duplicates are TRUE. See @seealso format_check, @seealso spell_clean,
}
\description{
Wrapper functions to implement a multi-step cleaning routine
for hierarchically structured taxonomic data. The first part
of the routine calls @seealso format_check to perform a few
presumptive checks on all columns, scanning for non-letter
characters and checking the number of words in each string.
By default, @seealso clean_name is called to ensure correct
formatting as this improves downstream checking. The second
part of the routine calls @seealso spell_check to flag spelling
discrepancies between names within a given taxonomic group. If
chosen, the function can automatically impose the more frequent
spelling. The third part of the routine calls @seealso discrete_ranks
to flag name re-use at different taxonomic levels. Some of these
cases may arise when a name has been unfortunately, (although
permissibly) used to refer to groups at different taxonomic levels,
or where a higher classification may have been inserted as a
placeholder for a missing lower classification. The fourth part of
the routine calls @seealso find_duplicates to flag variable higher
classifications for a given taxon, including cases where a higher
classification is missing for one instance of a taxon, but present
for the others. If chosen, @seealso resolve_duplicates is called
to ensure a consistent classification is imposed. For cases where
a name has been re-used at the same rank for genuinely different
taxa (not permissible, unlike name re-use at different ranks)
suffixes are added as capital letters, e.g. TaxonA, TaxonB. If
any of the automatic cleaning routines are employed (again the
default behaviour as clean_name is TRUE by default), the function
will return are a cleaned version of the dataset. If the use of
suffixes from @seealso resolve_duplicates is not desirable,
the function behaviour can be altered so that any suffixes are
dropped before returning.
}
\details{
\itemize{
\item Data supply arguments *
}
}
\examples{
# load dataset
data("brachios")
# subsample brachios to make for a short example runtime
set.seed(1)
brachios <- brachios[sample(1:nrow(brachios), 1000),]
# define the taxonomic ranks used in the dataset (re-used elsewhere)
b_ranks <- c("phylum", "class", "order", "family", "genus")
# define a list of suffixes to be used at each taxonomic level when scanning for synonyms
b_suff = list(NULL, NULL, NULL, NULL, c("ina", "ella", "etta"))
# scan for errors
brachios <- check_taxonomy(brachios, suff_set = b_suff, ranks = b_ranks)
}
\seealso{
discrete_ranks and @seealso find_duplicates for details of the structure
of the flagging outputs
}
