% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/LncFinder.R
\name{extract_features}
\alias{extract_features}
\title{Extract the Features}
\usage{
extract_features(
  Sequences,
  label = NULL,
  SS.features = FALSE,
  format = "DNA",
  frequencies.file = "human",
  parallel.cores = 2
)
}
\arguments{
\item{Sequences}{mRNA sequences or long non-coding sequences. Can be a FASTA
file loaded by \code{\link[seqinr]{seqinr-package}} or
secondary structure sequences (Dot-Bracket Notation) obtained from function
\code{\link{run_RNAfold}}. If \code{Sequences} are secondary structure
sequences file, parameter \code{format} should be defined as \code{"SS"}.}

\item{label}{Optional. String. Indicate the label of the sequences such as
"NonCoding", "Coding".}

\item{SS.features}{Logical. If \code{SS.features = TRUE}, secondary structure
features will be extracted. In this case, \code{Sequences} should be secondary
structure sequences (Dot-Bracket Notation) obtained from function
\code{\link{run_RNAfold}} and parameter \code{format} should be set as \code{"SS"}.}

\item{format}{String. Can be \code{"DNA"} or \code{"SS"}. Define the format of
\code{Sequences}. \code{"DNA"} for DNA sequences and \code{"SS"} for secondary
structure sequences. This parameter must be set as \code{"SS"} when
\code{SS.features = TURE}.}

\item{frequencies.file}{String or a list obtained from function
\code{\link{make_frequencies}}. Input species name \code{"human"}, \code{"mouse"}
or \code{"wheat"} to use pre-build frequencies files. Or assign a users' own
frequencies file (See function \code{\link{make_frequencies}}).}

\item{parallel.cores}{Integer. The number of cores for parallel computation.
By default the number of cores is \code{2}. Users can set as \code{-1} to run
this function with all cores.}
}
\value{
Returns a data.frame. 11 features when \code{SS.features} is \code{FALSE},
and 19 features when \code{SS.features} is \code{TRUE}.
}
\description{
This function can construct the dataset. This function is only used
to extract the features, please use function \code{\link{build_model}} to build
new models.
}
\details{
This function extracts the features and constructs the dataset.

Considering that it is time consuming to obtain secondary structure sequences,
users can build the model only with features of sequence and EIIP
(\code{SS.features = FALSE}). When \code{SS.features = TRUE}, \code{Sequences}
should be secondary structure sequences (Dot-Bracket Notation) obtained from
function \code{\link{run_RNAfold}} and parameter \code{format} should be set
as \code{"SS"}.

Please note that:

Secondary structure features (\code{SS.features}) can improve the performance
when the species of unevaluated sequences is identical to the species of the
sequences that used to build the model.

However, if users are trying to predict sequences with the model trained on
other species, \code{SS.features} as \code{TRUE} may lead to low accuracy.
}
\section{Features}{

1. Features based on sequence:

   The length and coverage of the longest ORF (\code{ORF.Max.Len} and
   \code{ORF.Max.Cov});

   Log-Distance.lncRNA (\code{Seq.lnc.Dist});

   Log-Distance.protein-coding transcripts (\code{Seq.pct.Dist});

   Distance-Ratio.sequence (\code{Seq.Dist.Ratio}).


2. Features based on EIIP (electron-ion interaction pseudopotential) value:

   Signal at 1/3 position (\code{Signal.Peak});

   Signal to noise ratio (\code{SNR});

   the minimum value of the top 10\% power spectrum (\code{Signal.Min});

   the quantile Q1 and Q2 of the top 10\% power spectrum (\code{Singal.Q1}
   and \code{Signal.Q2})

   the maximum value of the top 10\% power spectrum (\code{Signal.Max}).


3. Features based on secondary structure sequence:

   Log-Distance.acguD.lncRNA (\code{Dot_lnc.dist});

   Log-Distance.acguD.protein-coding transcripts (\code{Dot_pct.dist});

   Distance-Ratio.acguD (\code{Dot_Dist.Ratio});

   Log-Distance.acgu-ACGU.lncRNA (\code{SS.lnc.dist});

   Log-Distance.acgu-ACGU.protein-coding transcripts (\code{SS.pct.dist});

   Distance-Ratio.acgu-ACGU (\code{SS.Dist.Ratio});

   Minimum free energy (\code{MFE});

   Percentage of Unpair-Pair (\code{UP.PCT})
}

\section{References}{

Siyu Han, Yanchun Liang, Qin Ma, Yangyi Xu, Yu Zhang, Wei Du, Cankun Wang & Ying Li.
LncFinder: an integrated platform for long non-coding RNA identification utilizing
sequence intrinsic composition, structural information, and physicochemical property.
\emph{Briefings in Bioinformatics}, 2019, 20(6):2009-2027.
}

\examples{
\dontrun{
data(demo_DNA.seq)
Seqs <- demo_DNA.seq

### Extract features with pre-build frequencies.file:
my_features <- extract_features(Seqs, label = "Class.of.the.Sequences",
                                SS.features = FALSE, format = "DNA",
                                frequencies.file = "mouse",
                                parallel.cores = 2)

### Use your own frequencies file by assign frequencies list to parameter
### "frequencies.file".
}
}
\seealso{
\code{\link{svm_tune}}, \code{\link{build_model}},
         \code{\link{make_frequencies}}, \code{\link{run_RNAfold}}, \code{\link{read_SS}}.
}
\author{
HAN Siyu
}
