\name{MS.DataCreation}
\alias{MS.DataCreation}

\title{
Create an initial data matrix from GC-MS analyses by collecting and assembling the information from chromatograms and mass spectra
}
\description{
This function constructs an initial data matrix by collecting and assembling the information from chromatograms and mass spectra from several GC-MS analyses. It performs peak detection if the input file is an ASCII. For all input files, peak retention times (or retention indices) are retrieved from the chromatograms and associated to their respective mass spectrum. Each row of the output data matrix represent one peak in one analysis and give the sample name in first column, the peak retention time (or retention index) in second column and the mass spectrum of the peak in the following columns. If the input file is in Agilent format, it is possible to add quantification information by reporting percent of the total corrected area and corrected area.
}
\usage{
MS.DataCreation(path, mz, DataType, N_filt, apex, quant = FALSE)
}

\arguments{
  \item{path}{
Name of the folder containing all the GC-MS analyses
}
  \item{mz}{
Range of mass fragments delimiting the mass spectrum, e.g. 30:250
}
  \item{DataType}{
Indicate the type of input files: \emph{Agilent} when sample folders are obtained with Agilent Technologies machines (extension .D) or \emph{ASCII} when sample folders contains files as returned by trans.ASCII
}
  \item{N_filt}{
When selecting \emph{ASCII} data type, N_filt must be informed for chromatogram smoothing before peak detection. For more details about smoothing, please refer to the documentation of the function \emph{filter} with method=\emph{convolution}. If N_filt is lower than 3, there will be no smoothing of the profile. A high N_filt will lower the noise in the chromatogram but can result in the loss of low concentrated peaks
}
  \item{apex}{
\code{TRUE} indicates that the mass spectrum is considered at the apex of the peak and \code{FALSE} indicates that a mean mass spectrum is obtained by averaging 5 percent of the mass spectra surrounding the apex (apex included) for Agilent and by averaging the mass spectrum before, the mass spectrum after and the mass spectrum in the apex for ASCII files
}
  \item{quant}{
If DataType= \emph{Agilent}, the option quant indicates if quantification information should be extracted from rteres.txt and added to the initial data matrix. \code{TRUE} indicates that the two quantification columns corr.area (corrected peak area) and % of total (percent of the total corrected area) are extracted from rteres.txt and added in the initial data matrix after the column retention time (or retention index). Corrected area is used for absolute quantification when associated with the use of external and/or internal standards. Percent of the total corrected area is used for relative quantification (no external or internal standard needed). This choice will allow to generate a profiling matrix with quantification of each molecule after MS.clust. \code{FALSE} indicates that the quantification information should not be added to the initial data matrix. Then, a fingerprinting matrix (absence or presence of each molecule) will be obtained after MS.clust.
}
}
\details{
After a GC-MS analysis, a folder is created and contains different files from the chromatograph and from the mass spectrometer. The input files in the sample folder can be of different origins: 
	 
	(i) For Agilent Technologies providers (using the default parameters): each analysis returns a folder .D that contains a file rteres.txt with summary information of the chromatogram. A second file (with information of the mass spectra) is needed and can be generated by the user with the Chemstation dataanalysis software (Menu/Tools/Export3-D...), by default the generated file is Export3d.CSV and is placed in the .D folder.   
	 
	You should then select the option DataType=\emph{Agilent}. 
				The function first checks if all samples folders (.D) within the folder \emph{path} have both types of file rteres.txt and Export3d.CSV. If one file is missing, the analysis stops and indicates the name of the problematic sample. The analysis should be restarted after correction or removal. In a second time, the function collects the peak's retention time (or retention index) in rteres.txt and look for corresponding mass spectra in Export3d.CSV. Depending on the Apex option, the mean mass spectrum per each peak is calculated or the mass spectrum at the apex is extracted. The intensity, in counts, of each mass fragment is transformed to a relative percentage of the highest mass fragment per spectrum.  If quant = TRUE, the two quantification columns CorrArea (corrected peak area) and PercTot (percent of the total corrected area) are extracted for each peak from rteres.txt and placed respectively in columns 3 and 4 of the output data matrix.  
				 
	(ii) For other providers: data should be transformed into the international ASCII format. All files (one per analysis) should be grouped in the folder \emph{path} and then pass through the trans.ASCII function. The first step includes a smoothing of chromatogram depending on the option N_filt (see the documentation of the function \emph{filter}, method=\emph{convolution}). Afterward, peak are detected by the succession of 3 points with increasing intensity directly followed by three points of decreasing intensity (all points should have an intensity higher than 10 kilocounts). The first and last peaks of the chromatogram are removed if incomplete. In a third time, depending on the Apex option, the function calculates the mean mass spectrum per each peak or extracts the mass spectrum at the apex and the intensity (in counts) of each mass fragment is transformed to a relative percentage of the highest mass fragment per spectrum.    
			 
	During the analysis, a temporary file called save_list_temp.rda is automatically generated in folder \emph{path}. 
		 
	The final output file called initial_DATA.txt is saved in folder \emph{Output_MSDataCreation_resultdate_time}.  
	The output data matrix contains the relative mass spectrum of each peak of all samples. The first column contains sample name (the name of the folder containing the GC-MS analysis), the second column is the peak retention time (or retention index) and the following columns correspond to the relative mass spectrum of the peak (within the range of the mass spectrum).  

If quant = TRUE for DataType= Agilent, the first column contains sample name, the second column is the peak retention time (or retention index), the third column contains corrected area (CorrArea), the fourth column contains percent of the total corrected arera (PercTot) and the following columns correspond to the relative mass spectrum of the peak (within the range of the mass spectrum). 
}
\value{
MS.DataCreation returns a data matrix as an object in R and this data matrix, called initial_DATA.txt, is also saved in folder \emph{Output_MSDataCreation_resultdate_time}. It contains one row per peak and per individual with the information in column of the sample name, the retention time (or retention index) and the relative mass spectrum. If quant =TRUE for DataType = Agilent, two supplementary columns corrArea and PercTot are added after the column retention time. 
A temporary list is generated during the process. It allows recovering temporary informations if the function stopped before ending because of errors.

}

\author{
Elodie Courtois, Yann Guitton, Florence Nicole
}


\examples{
\dontrun{ 
##not run 
##For Agilent GC-MS files (rteres.txt and Export3d.CSV)
pathAgilent<-system.file("doc/Agilent_MSDataCreation",
package="MSeasy")
MS.DataCreation(path=pathAgilent,mz=30:250,DataType="Agilent",apex=FALSE) 

##For ASCII GC-MS files  
pathASCII<-system.file("doc/ASCII_MSDataCreation",
package="MSeasy")
MS.DataCreation(path=pathASCII,mz=30:250,DataType="ASCII",apex=TRUE, N_filt=3) 
  }
}


