% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/molting.R
\name{molting}
\alias{molting}
\title{Molt: De-identify a Dataset with Hash-based Relinking}
\usage{
molting(
  data,
  id_cols = NULL,
  pii_patterns = NULL,
  additional_pii_cols = NULL,
  hash_method = "sha256",
  hash_col_name = "row_hash",
  return_lookup = TRUE,
  seed = NULL
)
}
\arguments{
\item{data}{A data frame to be de-identified.}

\item{id_cols}{An optional character vector of column names to use for
creating the hash. If NULL (the default), the function will use the
PII columns it automatically detects.}

\item{pii_patterns}{An optional character vector of regular expression
patterns used to detect PII columns for removal. The default list
includes common identifiers.}

\item{additional_pii_cols}{An optional character vector of specific column
names to remove as PII, in addition to those detected by pattern matching.
Useful for adding dataset-specific identifiers without modifying patterns.}

\item{hash_method}{The hashing algorithm to use. Options include "sha256"
(default), "md5", "sha1", "sha512", "crc32", "xxhash32", "xxhash64",
"murmur32", "spookyhash", or "blake3". See ?digest::digest for details.}

\item{hash_col_name}{A string for the name of the new hash column.
Defaults to "row_hash".}

\item{return_lookup}{Logical. If TRUE (default), returns a list containing
both the de-identified data and a lookup table. If FALSE, returns only
the de-identified data frame.}

\item{seed}{An optional integer seed for reproducible hashing with certain
algorithms. Defaults to NULL.}
}
\value{
If return_lookup = TRUE (default), a list with two elements:
  \itemize{
    \item \code{deidentified}: The de-identified data frame with hash column
    \item \code{lookup}: A data frame containing only the identifier columns
          and the hash for relinking
  }
  If return_lookup = FALSE, returns only the de-identified data frame.
}
\description{
Like a bird molting its feathers for new plumage, this function removes
identifiable information and replaces it with a unique hash for each row.
It returns both the de-identified dataset and a lookup table for relinking.
Age category variables (age2cat, age3cat, etc.) are automatically retained.
}
\details{
The function identifies PII columns based on pattern matching, creates a
unique hash for each row based on the concatenated identifier values, and
returns both a de-identified dataset and a secure lookup table.

Age category variables (variables matching the pattern "age\\d+cat" such as
age2cat, age5cat, age10cat, etc.) are automatically retained in the
de-identified dataset as they are not considered directly identifying.

Security Note: The lookup table contains sensitive information and should be
stored securely with appropriate access controls. Consider encrypting this
file if storing to disk.
}
\examples{
# Create sample data
patient_data <- data.frame(
  patient_name = c("John Doe", "Jane Smith"),
  dob = as.Date(c("1980-01-01", "1975-05-15")),
  mrn = c("12345", "67890"),
  age5cat = factor(c("18-64", "18-64")),
  diagnosis = c("Condition A", "Condition B"),
  lab_value = c(120, 95)
)

# Basic de-identification (age categories automatically retained)
result <- suppressMessages(molting(patient_data))
names(result$deidentified)  # Check column names
head(result$deidentified, 2)  # View de-identified data

# Use different hash method
result_md5 <- suppressMessages(
  molting(patient_data, hash_method = "md5")
)

# Return only de-identified data (no lookup table)
deidentified_only <- suppressMessages(
  molting(patient_data, return_lookup = FALSE)
)

# Add specific columns to PII removal
result_custom <- suppressMessages(
  molting(patient_data, additional_pii_cols = c("study_id"))
)

# Specify custom identifier columns for hashing
result_ids <- suppressMessages(
  molting(patient_data, id_cols = c("mrn", "dob"))
)

}
