Title: | Adaptive Processing of LC-MS Data |
Version: | 6.8.3 |
Date: | 2025-07-30 |
Description: | Provides methods for the processing of liquid chromatography-mass spectrometry (LC/MS) based metabolomics data, including adaptive tolerance level searching, non-parametric intensity grouping, the use of run filter to preserve weak signals, model-based estimation of peak intensities, and peak detection based on existing knowledge. Related references include Yu et al. (2009) <doi:10.1093/bioinformatics/btp291>, Liu et al. (2020) <doi:10.1038/s41598-020-70850-0>, Yu et al. (2014) <doi:10.1093/bioinformatics/btu430>, Yu et al. (2013) <doi:10.1021/pr301053d>. |
Depends: | R (≥ 2.10), foreach, iterators, ROCR, Rcpp, doParallel |
Imports: | rgl, mzR, e1071, gbm, randomForest, MASS, splines, ROCS |
Suggests: | msdata |
biocViews: | Technology, MassSpectrometry |
License: | GPL-2 | GPL-3 [expanded from: GPL (≥ 2)] |
LazyLoad: | yes |
NeedsCompilation: | yes |
Packaged: | 2025-07-31 00:56:58 UTC; 123 |
LinkingTo: | Rcpp |
Author: | Tianwei Yu [aut, cre] |
Maintainer: | Tianwei Yu <yutianwei@cuhk.edu.cn> |
Repository: | CRAN |
Date/Publication: | 2025-08-19 14:30:08 UTC |
Adaptive processing of LC/MS data
Description
The package generates a feature table from a batch of LC/MS spectra. It finds m/z and retention time tolerance levels from the data. A run-filter is used to detect peaks and remove noise. Non-parametric statistical methods are used to find-tune peak selection and grouping. After retention time correction, a feature table is generated by aligning peaks across spectra.
Author(s)
Tianwei Yu <tyu8@emory.edu> Maintainer: Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559. J. Proteome Res. 12(3):1419-27.
Plot extracted ion chromatograms
Description
Given an output object from the function cdf.to.ftr(), this function plots the EICs selected by the user.
Usage
EIC.plot(aligned, rows = NA, colors = NA, transform = "none",
subset = NA, min.run, min.pres, max.spline.time.points
= 1000)
Arguments
aligned |
An output object from cdf.to.ftr(). |
rows |
A numeric vector selecting which rows of the aligned feature table to be plotted. |
colors |
The colors (one per profile) the user wishes to use for the plots. The default is NA, in which case a default color set is used. |
transform |
There are four possible values. "none": the original intensity data is plotted; "log": the intensity data is transformed by log(x+1); "sqrt": the intensity data is square root transformed; "cuberoot": the intensity data is cube root transformed. |
subset |
The user can choose a subset of the profiles for which the EICs are plotted. It is given as a vector of profile indecies. The default is NA, in which case the EICs from all the profiles are plotted. |
min.run |
The min.run parameter used in the proc.cdf() step. |
min.pres |
The min.pres parameter used in the proc.cdf() step. |
max.spline.time.points |
The maximum time points to use in spline fit. |
Details
The EICs are plotted as overlaid line plots. The graphic device is divided into four parts, each of which is used to plot one EIC. When all four parts are occupied, the function calls x11() to open another graphic device. The colors used (one per profile) is printed in the command window.
Value
There is no return value.
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
Plot extracted ion chromatograms based on the machine learning method output
Description
Given an output object from the function semi.sup.learn(), this function plots the EICs selected by the user.
Usage
EIC.plot.learn(aligned, rows = NA, colors = NA, transform = "none",
subset = NA, tol = 2.5e-05, ridge.smoother.window =
50, baseline.correct = 0, max.spline.time.points =
1000)
Arguments
aligned |
An output object from cdf.to.ftr(). |
rows |
A numeric vector selecting which rows of the aligned feature table to be plotted. |
colors |
The colors (one per profile) the user wishes to use for the plots. The default is NA, in which case a default color set is used. |
transform |
There are four possible values. "none": the original intensity data is plotted; "log": the intensity data is transformed by log(x+1); "sqrt": the intensity data is square root transformed; "cuberoot": the intensity data is cube root transformed. |
subset |
The user can choose a subset of the profiles for which the EICs are plotted. It is given as a vector of profile indecies. The default is NA, in which case the EICs from all the profiles are plotted. |
tol |
The mz tolerance level used in learn.cdf(). |
ridge.smoother.window |
The ridge.smoother.window parameter value used in learn.cdf(). |
baseline.correct |
The baseline.correct parameter value used in learn.cdf(). |
max.spline.time.points |
The maximum number of points to use in the spline fit along the retention time axis. |
Details
The function plots a single EIC. It plots intensity against retention time. It uses different color for different profiles.
Value
There is no return value.
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
Adaptive binning
Description
This is an internal function. It creates EICs using adaptive binning procedure
Usage
adaptive.bin(x, min.run, min.pres, tol, baseline.correct, weighted=FALSE)
Arguments
x |
A matrix with columns of m/z, retention time, intensity. |
min.pres |
Run filter parameter. The minimum proportion of presence in the time period for a series of signals grouped by m/z to be considered a peak. |
min.run |
Run filter parameter. The minimum length of elution time for a series of signals grouped by m/z to be considered a peak. |
tol |
m/z tolerance level for the grouping of data points. This value is expressed as the fraction of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. The recommended value is the machine's nominal accuracy level. Divide the ppm value by 1e6. For FTMS, 1e-5 is recommended. |
baseline.correct |
After grouping the observations, the highest intensity in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, in which case the program uses the 75th percentile of the height of the noise groups. |
weighted |
Whether to weight the local density by signal intensities. |
Details
It uses repeated smoothing and splitting to separate EICs. The details are described in the reference and flowchart.
Value
A list is returned.
height.rec |
The records of the height of each EIC. |
masses |
The vector of m/z values after binning. |
labels |
The vector of retention time after binning. |
intensi |
The vector of intensity values after binning. |
grps |
The EIC labels, i.e. which EIC each observed data point belongs to. |
times |
All the unique retention time values, ordered. |
tol |
The m/z tolerance level. |
min.count.run |
The minimum number of elution time points for a series of signals grouped by m/z to be considered a peak. |
weighted |
Whether to weight the local density by signal intensities. |
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
Adaptive binning specifically for the machine learning approach.
Description
This is an internal function. It creates EICs using adaptive binning procedure
Usage
adaptive.bin.2(x, tol, ridge.smoother.window=50, baseline.correct)
Arguments
x |
A matrix with columns of m/z, retention time, intensity. |
tol |
m/z tolerance level for the grouping of data points. This value is expressed as the fraction of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. The recommended value is the machine's nominal accuracy level. Divide the ppm value by 1e6. For FTMS, 1e-5 is recommended. |
ridge.smoother.window |
The size of the smoother window used by the kernel smoother to remove long ridge noise from the EIC. |
baseline.correct |
After grouping the observations, the highest intensity in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, in which case the program uses the 75th percentile of the height of the noise groups. |
Details
It uses repeated smoothing and splitting to separate EICs. The details are described in the reference and flowchart.
Value
A list is returned.
height.rec |
The records of the height of each EIC. |
masses |
The vector of m/z values after binning. |
labels |
The vector of retention time after binning. |
intensi |
The vector of intensity values after binning. |
grps |
The EIC labels, i.e. which EIC each observed data point belongs to. |
times |
All the unique retention time values, ordered. |
tol |
The m/z tolerance level. |
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 30(20): 2941-2948. Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
A table of potential adducts.
Description
The data is based on the Metabolomics FieHn Lab's Mass Spectrometry Adduct Calculator. It provides the basis for calculating the m/z of the ion forms of known metabolites.
Usage
data(adduct.table)
Format
A data frame with 47 observations on the following 4 variables.
- adduct
The ion form.
- divider
The value to divide the neutral mass by.
- addition
The value to add after dividing.
- charge
The charge state of the ion form.
Source
http://fiehnlab.ucdavis.edu/staff/kind/Metabolomics/MS-Adduct-Calculator/
References
Huang N.; Siegel M.M.1; Kruppa G.H.; Laukien F.H. Automation of a Fourier transform ion cyclotron resonance mass spectrometer for acquisition, analysis, and e-mailing of high-resolution exact-mass electrospray ionization mass spectral data. J Am Soc Mass Spectrom 1999, 10, 1166-1173.
Examples
data(metabolite.table)
data(adduct.table)
known.table.example<-make.known.table(metabolite.table[1001:1020,], adduct.table[1:4,])
Adjust retention time across spectra.
Description
This function adjusts the retention time in each LC/MS profile to achieve better between-profile agreement.
Usage
adjust.time(features, mz.tol = NA, chr.tol = NA, colors=NA, find.tol.max.d=1e-4,
max.align.mz.diff=0.01, transform.mz=FALSE, transform.mz.const=0.1)
Arguments
features |
A list object. Each component is a matrix which is the output from proc.to.feature(). |
mz.tol |
The m/z tolerance level for peak alignment. The default is NA, which allows the program to search for the tolerance level based on the data. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. |
chr.tol |
The retention time tolerance level for peak alignment. The default is NA, which allows the program to search for the tolerance level based on the data. |
colors |
The vector of colors to be used for the line plots of time adjustments. The default is NA, in which case the program uses a set of default color set. |
find.tol.max.d |
Argument passed to find.tol(). Consider only m/z diffs smaller than this value. This is only used when the mz.tol is NA. |
max.align.mz.diff |
As the m/z tolerance is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
transform.mz |
Whether to apply a nonlinear transformation to m/z values before alignment. |
transform.mz.const |
A constant used in the m/z transformation function |
Details
The function first searches for the m/z tolerance level using a mixture model. After the mz.tol is obtained, the peaks are grouped based on it. The function then searches for the retention time tolerance level. Because the peaks are grouped using m/z, only metabolites that share m/z require this parameter. A rather lenient retention time tolerance level is found using a mixture model.
The profile with the highest number of peaks is selected as the template and every other spetrum is adjusted to it one at a time. At every m/z value, if each of the two spetra has just one peak, and the peaks are within the retention time tolerance range, the pair of retention time values are used in the curve fitting. A kernel smoother is fitted using the difference in retention time against the retention time in the profile to be adjusted.
Value
A list object with the exact same structure as the input object features, i.e. one matrix per profile being processed. The only difference this output object has with the input object is that the retention time column in each of the matrices is changed to new adjusted values.
Author(s)
Tianwei Yu <tyu8@emory.edu>
See Also
feature.align
Examples
data(features)
adjusted<-adjust.time(features, colors=c("red","blue","green","cyan"))
Convert a number of cdf files in the same directory to a feature table
Description
This is a wrapper function, which calls four other functions to convert a number of cdf files to a feature table. All cdf files to be processed must be in a single folder.
Usage
cdf.to.ftr(folder, output_path, file.pattern=".cdf", n.nodes=4, min.exp=2,
min.pres=0.5, min.run=12, mz.tol=1e-5, baseline.correct.noise.percentile=0.05,
shape.model="bi-Gaussian", BIC.factor=2, baseline.correct=0, peak.estim.method="moment",
min.bw=NA, max.bw=NA, sd.cut=c(0.01,500), sigma.ratio.lim=c(0.01, 100),
component.eliminate=0.01, moment.power=1, subs=NULL, align.mz.tol=NA, align.chr.tol=NA,
max.align.mz.diff=0.01, pre.process=FALSE, recover.mz.range=NA, recover.chr.range=NA,
use.observed.range=TRUE, recover.min.count=3, intensity.weighted=FALSE)
Arguments
folder |
The folder where all CDF files to be processed are located. For example ?C:/CDF/this_experiment? |
output_path |
Path to the output directory |
file.pattern |
The pattern in the names of the files to be processed. The default is ".cdf". Other formats supported by mzR package can also be used, e.g. "mzML" etc. |
n.nodes |
The number of CPU cores to be used through doSNOW. |
min.exp |
If a feature is to be included in the final feature table, it must be present in at least this number of spectra. |
min.pres |
This is a parameter of thr run filter, to be passed to the function proc.cdf(). Please see the help for proc.cdf() for details. |
min.run |
This is a parameter of thr run filter, to be passed to the function proc.cdf(). Please see the help for proc.cdf() for details. |
subs |
If not all the CDF files in the folder are to be processed, the user can define a subset using this parameter. For example, subs=15:30, or subs=c(2,4,6,8) |
mz.tol |
The user can provide the m/z tolerance level for peak identification. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. Please see the help for proc.cdf() for details. |
baseline.correct.noise.percentile |
The perenctile of signal strength of those EIC that don't pass the run filter, to be used as the baseline threshold of signal strength. This parameter is passed to proc.cdf() |
shape.model |
The mathematical model for the shape of a peak. There are two choices - bi-Gaussian and Gaussian. When the peaks are asymmetric, the bi-Gaussian is better. The default is bi-Gaussian. |
BIC.factor |
the factor that is multiplied on the number of parameters to modify the BIC criterion. If larger than 1, models with more peaks are penalized more. |
baseline.correct |
This is a parameter in peak detection. After grouping the observations, the highest observation in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, which allows the program to search for the cutoff level. Please see the help for proc.cdf() for details. |
peak.estim.method |
the bi-Gaussian peak parameter estimation method, to be passed to subroutine prof.to.features. Two possible values: moment and EM. |
min.bw |
The minimum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
max.bw |
The maximum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
sd.cut |
A parameter for the prof.to.features() function. A vector of two. Features with standard deviation outside the range defined by the two numbers are eliminated. |
sigma.ratio.lim |
A parameter for the prof.to.features() function. A vector of two. It enforces the belief of the range of the ratio between the left-standard deviation and the righ-standard deviation of the bi-Gaussian fuction used to fit the data. |
component.eliminate |
In fitting mixture of bi-Gaussian (or Gaussian) model of an EIC, when a component accounts for a proportion of intensities less than this value, the component will be ignored. |
moment.power |
The power parameter for data transformation when fitting the bi-Gaussian or Gaussian mixture model in an EIC. |
align.chr.tol |
The user can provide the elution time tolerance level to override the program?s selection. This value is in the same unit as the elution time, normaly seconds. Please see the help for match.time() for details. |
align.mz.tol |
The user can provide the m/z tolerance level for peak alignment to override the program?s selection. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level.Please see the help for feature.align() for details. |
max.align.mz.diff |
As the m/z tolerance in alignment is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
pre.process |
Logical. If true, the program will not perform time correction and alignment. It will only generate peak tables for each spectra and save the files. It allows manually dividing the task to multiple machines. |
recover.mz.range |
A parameter of the recover.weaker() function. The m/z around the feature m/z to search for observations. The default value is NA, in which case 1.5 times the m/z tolerance in the aligned object will be used. |
recover.chr.range |
A parameter of the recover.weaker() function. The retention time around the feature retention time to search for observations. The default value is NA, in which case 0.5 times the retention time tolerance in the aligned object will be used. |
use.observed.range |
A parameter of the recover.weaker() function. If the value is TRUE, the actual range of the observed locations of the feature in all the spectra will be used. |
recover.min.count |
The minimum time point count for a series of point in the EIC for it to be considered a true feature. |
intensity.weighted |
Whether to weight the local density by signal intensities in the initial peak detection. |
Details
The wrapper function calls five other functions to perform the feature table generation. Every spectrum (cdf file) first goes through proc.cdf() and prof.to.feature() to generate a spectrum-level peak table. The eluction time correction is done by match.time(). Then the peaks are aligned across spectra by feature.align(). For features deteced in a portion of the spectra, weaker signals in other spectra are recovered by recover.weaker(). From version 4, the parameter mz.tol can no longer be NA. This is to allow the program better process data other than FTLCMS. It is recommended that the user use the machine's claimed accuracy. For FTMS, 1e-5 is recommended.
Value
A list is returned.
features |
A list object, each component of which being the peak table from a single spectrum. |
features2 |
A list object, each component of which being the peak table from a single spectrum, after elution time correction. |
aligned.ftrs |
Feature table BEFORE weak signal recovery. |
final.ftrs |
Feature table after weak signal recovery. This is the end product of the function. |
pk.times |
Table of feature elution time BEFORE weak signal recovery. |
final.times |
Table of feature elution time after weak signal recovery. |
mz.tol |
The input mz.tol value by the user. |
align.mz.tol |
The m/z tolerance level in the alignment across spectra, either input from the user or automatically selected when the user input is NA. |
align.chr.tol |
The retention time tolerance level in the alignment across spectra, either input from the user or automatically selected when the user input is NA. |
Author(s)
Tianwei Yu <tyu8@sph.emory.edu>
See Also
proc.cdf, prof.to.feature, adjust.time, feature.align, recover.weaker
Continuity index
Description
This is an internal function. It uses continuity index (or "run filter") to select putative peaks from EIC.
Usage
cont.index(newprof, min.pres = 0.6, min.run = 5)
Arguments
newprof |
The matrix containing m/z, retention time, intensity, and EIC label as columns. |
min.pres |
Run filter parameter. The minimum proportion of presence in the time period for a series of signals grouped by m/z to be considered a peak. |
min.run |
Run filter parameter. The minimum length of elution time for a series of signals grouped by m/z to be considered a peak. |
Details
This is the run filter described in Yu et al Bioinformatics 2009.
Value
A list is returned.
new.rec |
The matrix containing m/z, retention time, intensity, and EIC label as columns after applying the run filter. |
height.rec |
The vector of peak heights. |
time.range.rec |
The vector of peak retention time span. |
mz/pres.rec |
The vector of proportion of non-missing m/z. |
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
Internal function: Extract data feature from EIC.
Description
The function extracts data features after applying different smoother settings.
Usage
eic.disect(raw.prof, smoother.window = c(1, 5, 10))
Arguments
raw.prof |
The data after adaptive binning, i.e. the output from adaptive.bin.2(). |
smoother.window |
The smoother window sizes to use for data feature extraction. |
Details
We take a number of data characteristic measurements from each EIC, including m/z span, m/z standard deviation, retention time (RT) span, RT peak location, and summary statistics on the raw intensity values of the EIC. We also centroid the data in each EIC such that it becomes two-dimensional data (intensity v.s. RT). We then apply different smoothers (shape/window size) in combination of different weighting schemes (unweighted, weighted with intensity, weighted with log intensity) to each EIC. At each smoothing setting, we record summary statistics of smoothed data.
Value
A matrix. Every row corresponds to an EIC. Every column corresponds to a data feature.
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 30(20): 2941-2948.
Internal function: calculate the score for each EIC based on prediction of match status.
Description
This function uses predictive models to evaluate the data features, and give scores to every EIC, which serves as the basis for EIC selection.
Usage
eic.pred(eic.rec, known.mz, mass.matched = NA, to.use = 10, do.plot = FALSE,
match.tol.ppm = 5, do.grp.reduce = TRUE, remove.bottom = 5, max.fpr = 0.3, min.tpr = 0.8)
Arguments
eic.rec |
The matrix of data features from every EIC. Each row is an EIC. Each column is a data feature value. |
known.mz |
The m/z values of the known metabolic features. |
mass.matched |
An indicator vector. "1" means the corresponding EIC has an m/z matched to known features. The default is NA, in which case the matching is done inside this function. |
to.use |
The maximum number of data features to use in the predictive models. |
do.plot |
Whether diagnostic plots would be generated. |
match.tol.ppm |
The tolerance level in the m/z match, at ppm scale. |
do.grp.reduce |
Whether to reduce the data features first by reducing each group of similar features into one. |
remove.bottom |
The number of worst performing data features to remove before model building. If true, the removal is done based on single predictor ROC analysis. |
max.fpr |
The threshold for selecting unmatched EICs. Each EIC is assigned an FPR value based on the final prediction model. Those with FPR smaller than this threshold will be selected. If a vector is provided, the first one will be used. But all FPR values will also be returned. So other functions will be able to make selections based on other threshold values. |
min.tpr |
The threshold for selecting matched EICs. Each EIC is assigned an TPR value based on the final prediction model. Those with TPR larger than this threshold will be selected. If a vector is provided, the first one will be used. But all TPR values will also be returned. So other functions will be able to make selections based on other threshold values. |
Details
The function first subsample the EICs to balance the unmatched/matched. Then it randomly split the data into training and testing set. Combinations of feature ranking and predictive models are used, and their performance guaged using the testing set. The overall best model is selected, and the EICs each receive a score based on this model.
Although there is a single scoring system for all EICs, those matched are treated differently than unmatched, because we have higher confidence in them being real metabolites. The matched are selected using the "min.tpr" threshold, to ensure the majority of them enter next step. Those unmatched are selected using the "max.fpr" threshold.
Value
A list item is returned.
chosen |
An indicator vector. "1" means the EIC is selected; "0" means unselected. When multiple min.tpr and/or max.fpr are provided, this vector corresponds to the combination of the first min.tpr and max.fpr. |
fpr |
The vector of FPR values, each value corresponds to the FPR at the cutoff of the specific EIC. |
tpr |
The vector of TPR values, each value corresponds to the TPR at the cutoff of the specific EIC. |
matched |
An indicator vector. "1" means matched to known features. "0" means unmatched. |
pred.performance |
Prediction performance of all models tested. |
feature.rank.method |
Which method is used for ranking features. |
model |
Which prediction model is used. |
feature importance |
The importance score of all data features generated by the feature ranking method. |
used.features |
The names of the features used in the final model. |
final.auc |
The AUC of the selected model. |
Author(s)
Tianwei Yu <tianwei.yu@emory.edu>
References
Bioinformatics. 30(20): 2941-2948.
See Also
semi.sup.learn, eic.qual, eic.disect
Internal function: Calculate the single predictor quality.
Description
For each column of an EIC data feature matrix, find its predictive power on the m/z match to known metabolites.
Usage
eic.qual(eic.rec, known.mz, mass.matched = NA, match.tol.ppm = 5, do.plot = FALSE,
pos.confidence = 0.99, neg.confidence = 0.99)
Arguments
eic.rec |
The EIC data feature matrix. Each row is an EIC. Each column is a data feature. |
known.mz |
The known m/z values to be matched to. |
mass.matched |
A vector of indicators of whether the m/z of each EIC is matched to the known m/z values. The default is NA, in which case it is calculated within the function. |
match.tol.ppm |
The tolerance level of m/z match. |
do.plot |
Whether to produce plots of the ROCS. |
pos.confidence |
The confidence level for the features matched to the known feature list. |
neg.confidence |
The confidence level for the features not matching to the known feature list. |
Value
A matrix of four columns. The first two columns are the VUS and AUC without uncertainty. The next two columns are the VUS and AUC with uncertainty.
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 30(20): 2941-2948.
Align peaks from spectra into a feature table.
Description
Identifies which of the peaks from the profiles correspond to the same feature.
Usage
feature.align(features, min.exp = 2, mz.tol = NA, chr.tol = NA,find.tol.max.d=1e-4,
max.align.mz.diff=0.01)
Arguments
features |
A list object. Each component is a matrix which is the output from proc.to.feature(). |
min.exp |
A feature has to show up in at least this number of profiles to be included in the final result. |
mz.tol |
The m/z tolerance level for peak alignment. The default is NA, which allows the program to search for the tolerance level based on the data. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. |
chr.tol |
The retention time tolerance level for peak alignment. The default is NA, which allows the program to search for the tolerance level based on the data. |
find.tol.max.d |
Argument passed to find.tol(). Consider only m/z diffs smaller than this value.This is only used when the mz.tol is NA. |
max.align.mz.diff |
As the m/z tolerance is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
Details
The function first searches for the m/z tolerance level using a mixture model. After the mz.tol is obtained, the peaks are grouped based on it. Consecutive peaks with m/z value difference smaller than the tolerance level are considered to belong to the same peak group. Non-parametric density estimation within each peak group is used to further split peak groups. The function then searches for the retention time tolerance level. Because the peaks are grouped using m/z, only metabolites that share m/z require this parameter. A rather lenient retention time tolerance level is found using a mixture model. After splitting the peak groups by this value, non-parametric density estimation is used to further split peak groups. Peaks belonging to one group are considered to correspond to the same feature.
Value
Returns a list object with the following objects in it:
aligned.ftrs |
A matrix, with columns of m/z values, elution times, signal strengths in each spectrum. |
pk.times |
A matrix, with columns of m/z, median elution time, and elution times in each spectrum. |
mz.tol |
The m/z tolerance used in the alignment. |
chr.tol |
The elution time tolerance in the alignment. |
Author(s)
Tianwei Yu <tyu8@emory.edu>
See Also
proc.to.feature
Examples
data(features)
features.2<-adjust.time(features)
this.aligned<-feature.align(features,min.exp=2)
summary(this.aligned)
this.aligned$aligned.ftrs[1:5,]
this.aligned$pk.times[1:5,]
Sample feature tables from 4 profiles
Description
A list object containing 4 matrices, each of which is the feature table from a profile.
Usage
data(features)
Format
List object containing multiple matrices. One matrix from each spectrum.
Source
Data from Dean Jones lab, Emory University School of Medicine.
Examples
data(features)
Internal function: finding the best match between a set of detected features and a set of known features.
Description
Given a small matrix of distances, find the best column-row pairing that minimize the sum of distances of the matched pairs.
Usage
find.match(a, unacceptable = 4)
Arguments
a |
A matrix of distances. |
unacceptable |
A distance larger than which cannot be accepted as pairs. |
Value
A matrix the same dimension as the input matrix, with matched position taking value 1, and all other positions taking value 0.
Author(s)
Tianwei Yu <tyu8@emory.edu>
An internal function that is not supposed to be directly accessed by the user. Find m/z tolerance level.
Description
The function finds the tolerance level in m/z from a given vector of observed m/z values.
Usage
find.tol(a, uppermost=1e-4, aver.bin.size=4000, min.bins=50, max.bins=200)
Arguments
a |
The vector of observed m/z values. |
uppermost |
Consider only m/z diffs smaller than this value. |
aver.bin.size |
The average bin size to determine the number of equally spaced points in the kernel density estimation. |
min.bins |
the minimum number of bins to use in the kernel density estimation. It overrides aver.bin.size when too few observations are present. |
max.bins |
the maximum number of bins to use in the kernel density estimation. It overrides aver.bin.size when too many observations are present. |
Details
The method assumes a mixture model: an unknown distribution of m/z variations in the same peak, and an exponential distribution of between-peak diffs. The parameter of the exponential distribution is estimated by the upper 75
Value
The tolerance level is returned.
Author(s)
Tianwei Yu <tyu8@emory.edu>
Examples
data(prof)
find.tol(prof[[1]][,1])
An internal function that is not supposed to be directly accessed by the user. Find elution time tolerance level.
Description
This function finds the time tolerance level. Also, it returns the grouping information given the time tolerance.
Usage
find.tol.time(mz, chr, lab, num.exp, mz.tol = 2e-05, chr.tol = NA,
aver.bin.size = 200, min.bins = 50, max.bins = 100,
max.mz.diff = 0.01, max.num.segments = 10000)
Arguments
mz |
mz value of all peaks in all profiles in the study. |
chr |
retention time of all peaks in all profiles in the study. |
lab |
label of all peaks in all profiles in the study. |
num.exp |
The number of spectra in this analysis. |
mz.tol |
m/z tolerance level for the grouping of signals into peaks. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. |
chr.tol |
the elution time tolerance. If NA, the function finds the tolerance level first. If a numerical value is given, the function directly goes to the second step - grouping peaks based on the tolerance. |
aver.bin.size |
The average bin size to determine the number of equally spaced points in the kernel density estimation. |
min.bins |
the minimum number of bins to use in the kernel density estimation. It overrides aver.bin.size when too few observations are present. |
max.mz.diff |
As the m/z tolerance in alignment is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
max.bins |
the maximum number of bins to use in the kernel density estimation. It overrides aver.bin.size when too many observations are present. |
max.num.segments |
the maximum number of segments. |
Details
The peaks are first ordered by m/z, and split into groups by the m/z tolerance. Then within every peak group, the pairwise elution time difference is calculated. All the pairwise elution time differences within groups are merged into a single vector. A mixture model (unknown distribution for distance between peaks from the same feature, and a triangle-shaped distribution for distance between peaks from different features) is fit to find the elution time tolerance level. The elution times within each peak group are then ordered. If a gap between consecutive retention times is larger than the elution time tolerance level, the group is further split at the gap. Grouping information is returned, as well as the elution time tolerance level.
Value
A list object is returned:
chr.tol |
The elution time tolerance level. |
comp2 |
A matrix with six columns. Every row corrsponds to a peak in one of the spectrum. The columns are: m/z, elution time, spread, signal strength, spectrum label, and peak group label. The rows are ordered by the median m/z of each peak group, and with each peak group the rows are ordered by the elution time. |
Author(s)
Tianwei Yu <tyu8@emory.edu>
Find peaks and valleys of a curve.
Description
This is an internal function which is not supposed to be directly accessed by the user. Finds the peaks and valleys of a smooth curve.
Usage
find.turn.point(y)
Arguments
y |
The y values of a curve in x-y plane. |
Value
A list object:
pks |
The peak positions. |
vlys |
The valley positions |
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
Interpolate missing intensities and calculate the area for a single EIC.
Description
This is an internal function that's not supposed to be called directly by the user.
Usage
interpol.area(x, y, all.x, all.w)
Arguments
x |
the positions of x(retention time) where non-NA y is observed. |
y |
the observed intensities. |
all.x |
all possible x(retention time) in the LCMS profile. |
all.w |
the "footprint" of each measured retention time, used as weight for the corresponding y. |
Details
This is an internal function. It interpolates missing y using linear interpolation, and then calculates the area under the curve.
Value
The area is returned.
Author(s)
Tianwei Yu <tyu8@emory.edu>
Peak detection using the machine learning approach.
Description
The procedure uses information of known metabolites, and constructs prediction models to differentiate EICs.
Usage
learn.cdf(filename, output_path, tol = 2e-05, min.run = 4, min.pres = 0.3,
baseline.correct = 0, ridge.smoother.window = 50, smoother.window = c(1, 5, 10),
known.mz, match.tol.ppm = 5, do.plot = FALSE, pos.confidence = 0.99,
neg.confidence = 0.99, max.ftrs.to.use = 10, do.grp.reduce = TRUE,
remove.bottom.ftrs = 0, max.fpr = seq(0, 0.6, by = 0.1), min.tpr = seq(0.8, 1, by = 0.1),
intensity.weighted=FALSE)
Arguments
filename |
The cdf file name. If the file is not in the working directory, the path needs to be given. |
output_path |
Path to the output directory |
min.pres |
Run filter parameter. The minimum proportion of presence in the time period for a series of signals grouped by m/z to be considered a peak. |
min.run |
Run filter parameter. The minimum length of elution time for a series of signals grouped by m/z to be considered a peak. |
tol |
m/z tolerance level for the grouping of data points. This value is expressed as the fraction of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. The recommended value is the machine's nominal accuracy level. Divide the ppm value by 1e6. For FTMS, 1e-5 is recommended. |
baseline.correct |
After grouping the observations, the highest intensity in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, in which case the program uses the 75th percentile of the height of the noise groups. |
ridge.smoother.window |
The size of the smoother window used by the kernel smoother to remove long ridge noise from each EIC. |
smoother.window |
The smoother windows to use in data feature generation. |
known.mz |
The m/z values of the known metabolites. |
match.tol.ppm |
The ppm tolerance to match identified features to known metabolites/features. |
do.plot |
Whether to produce diagnostic plots. |
pos.confidence |
The confidence level for the features matched to the known feature list. |
neg.confidence |
The confidence level for the features not matching to the known feature list. |
max.ftrs.to.use |
The maximum number of data features to use in a predictive model. |
do.grp.reduce |
Whether to reduce data features that are similar. It is based on data feature predictability. |
remove.bottom.ftrs |
The number of worst performing data features to remove before model building. |
max.fpr |
The proportion of unmatched features to be selected in the feature detection step. |
min.tpr |
The proportion of matched features to be selected in the feature detection step. |
intensity.weighted |
Whether to weight the local density by signal intensities. |
Details
The subroutine takes CDF, mxml etc LC/MS profile. First the profile is sliced into EICs using adaptive binning. Then data features are extracted from each EIC. The EICs are classified into two groups: those that have m/z values that match to known m/z values, and those that don't. Classification models are built to separate the two classes, and each EIC is given a score by the classification model. Those with better scores are selected to enter the feature quantification step.
Value
A matrix with four columns: m/z value, retention time, intensity, and group number.
Author(s)
Tianwei Yu <tyu8@emory.edu>
Loading LC/MS data.
Description
This is an internal function. It loads LC/MS data into memory.
Usage
load.lcms(filename)
Arguments
filename |
The CDF file name. |
Details
The function uses functionality provided by the mzR package from Bioconductor.
Value
A list is returned.
masses |
The vector of m/z values. |
labels |
The vector of retention times. |
intensi |
The vector of intensity values. |
times |
The vector of unique time points. |
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
Producing a table of known features based on a table of metabolites and a table of allowable adducts.
Description
Given a table of known metabolites with original mass and charge information, and a table of allowable adducts, this function outputs a new table of potential features.
Usage
make.known.table(metabolite.table, adduct.table, ion.mode = "+")
Arguments
metabolite.table |
A table of known metabolites. See the description of the object "metabolite.table" for details. |
adduct.table |
A table of allowable adducts. See the description of the object "adduct.table" for details. |
ion.mode |
Character. Either "+" or "-". |
Details
For each allowable ion form, the function produces the m/z of every metabolite given to it. The output table follows the format that is required by the function semi.sup(), so that the user can directly use the table for semi supervised feature detection.
Value
A data frame containing the known metabolite ions. It contains 18 columns: "chemical_formula": the chemical formula if knonw; "HMDB_ID": HMDB ID if known; "KEGG_compound_ID": KEGG compound ID if known; "neutral.mass": the neutral mass if known: "ion.type": the ion form, such as H+, Na+, ..., if known; "m.z": m/z value, either theoretical for known metabolites, or mean observed value for unknown but previously found features; "Number_profiles_processed": the total number of LC/MS profiles that were used to build this database; "Percent_found": in what percentage was this feature found historically amount all data processed in building this database; "mz_min": the minimum m/z value observed for this feature; "mz_max": the maximum m/z value observed for this feature; "RT_mean": the mean retention time observed for this feature; "RT_sd": the standard deviation of retention time observed for this feature; "RT_min": the minimum retention time observed for this feature; "RT_max": the maximum retention time observed for this feature; "int_mean.log.": the mean log intensity observed for this feature; "int_sd.log.": the standard deviation of log intensity observed for this feature; "int_min.log.": the minimum log intensity observed for this feature; "int_max.log.": the maximum log intensity observed for this feature;
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Yu T, Park Y, Li S, Jones DP (2013) Hybrid feature detection and information accumulation using high-resolution LC-MS metabolomics data. J. Proteome Res. 12(3):1419-27.
See Also
metabolite.table, adduct.table, semi.sup
Examples
data(metabolite.table)
data(adduct.table)
known.table.example<-make.known.table(metabolite.table[1001:1020,], adduct.table[1:4,])
An internal function: finding matches between two vectors of m/z values.
Description
Given two vectors of m/z values and the tolerance ppm level, find the potential matches between the two vectors.
Usage
mass.match(x, known.mz, match.tol.ppm = 5)
Arguments
x |
m/z values from the data. |
known.mz |
m/z values from the known feature table. |
match.tol.ppm |
tolerance level in ppm. |
Value
A vector the same length as x. 1 indicates matched, and 0 indicates unmatched.
Author(s)
Tianwei Yu <tyu8@emory.edu>
An internal function.
Description
This is a internal function. It shouldn't be called by the end user.
Usage
merge_seq_3(a, mz, inte)
Arguments
a |
vector of retention time. |
mz |
vector of m/z ratio. |
inte |
vector of signal strength. |
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
A known metabolite table based on HMDB.
Description
This table was compiled from HMDB metabolites. It contains only the basic information of known metabolites. It can be used to produce feature tables with ion forms of the users' choice.
Usage
data(metabolite.table)
Format
A data frame containing the known metabolites. It contains 4 columns: "chemical_formula": the chemical formula of the known table; "HMDB_ID": HMDB ID; "KEGG_compound_ID": KEGG compound ID if known; "mass": the neutral mass;
Details
It is to be used in combination with the object "adduct.table", to create feature table with ion forms of the user's choice. Which ion form to choose should be based on the LC/MS system.
Source
Wishart, D. S., et al. (2009). HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res 37, D603-10.
Examples
data(metabolite.table)
data(adduct.table)
known.table.example<-make.known.table(metabolite.table[1001:1020,], adduct.table[1:4,])
Internal function: Updates the information of a feature for the known feature table.
Description
The function takes the information about the feature in the known feature table (if available), and updates it using the information found in the current dataset.
Usage
peak.characterize(existing.row = NA, ftrs.row, chr.row)
Arguments
existing.row |
The existing row in the known feature table (detailed in the semi.sup() document). |
ftrs.row |
The row of the matched feature in the new aligned feature table. |
chr.row |
The row of the matched feature in the new retention time table of aligned features. |
Details
The function calculates and updates the mean intensity, variation of intensity, mean retention time etc.
Value
A vector, the updated row for the known feature table.
Author(s)
Tianwei Yu <tyu8@emory.edu>
Plot the data in the m/z and retention time plane.
Description
This is a diagnostic function. It takes the original CDF file, as well as the detected feature table, and plots the data in the m/z - retention time plane, using a user-defined range. The entire data is too big to plot, thus the main purpose is to focus on small subregions of the data and check the peak detection results.
Usage
plot_cdf_2d(rawname, f, mzlim, timelim, lwd = 1)
Arguments
rawname |
The CDF file name. |
f |
The output object of prof.to.feature(). |
mzlim |
The m/z range to plot. |
timelim |
The retention time range to plot. |
lwd |
Line width parameter, to be passed on to the function line(). |
Value
There is no return value.
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
Plot the data in the m/z and retention time plane.
Description
This is a diagnostic function. It takes the original text file, as well as the detected feature table, and plots the data in the m/z - retention time plane, using a user-defined range. The entire data is too big to plot, thus the main purpose is to focus on small subregions of the data and check the peak detection results.
Usage
plot_txt_2d(rawname, f, mzlim, timelim, lwd = 1)
Arguments
rawname |
The text file name. |
f |
The output object of prof.to.feature(). |
mzlim |
The m/z range to plot. |
timelim |
The retention time range to plot. |
lwd |
Line width parameter, to be passed on to the function line(). |
Details
The columns in the text file need to be separated by tab. The first column needs to be the retention time, the second column the m/z values, and the third column the intensity values. The first row needs to be the column labels, rather than values.
Value
There is no return value.
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
Generates 3 dimensional plots for LCMS data.
Description
This function takes the matrix output from proc.cdf() and generates a 3D plot of the data. It relies on the rgl library.
Usage
present.cdf.3d(prof, fill.holes = TRUE, transform = "none", time.lim = NA,
mz.lim = NA, box = TRUE, axes = TRUE)
Arguments
prof |
The matrix output from the proc.cdf() function. |
fill.holes |
A lot of peaks have missing values at some time points. If fill.holes is TRUE, the function will fill in the missing values by interpolation. |
transform |
If the value is "sqrt", the values are transformed by taking square root. If "cuberoot", the values are transformed by taking cubic root. |
time.lim |
The limit in retention time for the area of spectrum to be plotted. It should be either NA or a vector of two values: the lower limit and the upper limit. |
mz.lim |
The limit in m/z value for the area of spectrum to be plotted. It should be either NA or a vector of two values: the lower limit and the upper limit. |
box |
If a box should be drawn. |
axes |
If the axes should be drawn. |
Details
The function calls the rgl library. Spectrum values within the time.lim and mz.lim range is plotted in 3D.
Value
There is no return value from this function.
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
http://rgl.neoscientists.org/about.shtml
Examples
data(prof)
present.cdf.3d(prof[[2]],time.lim=c(250,400), mz.lim=c(400,500))
Filter noise and detect peaks from LC/MS data in CDF format
Description
This function applies the run filter to remove noise. Data points are grouped into EICs in this step.
Usage
proc.cdf(filename, output_path, min.pres=0.5, min.run=12, tol=1e-5, baseline.correct=0,
baseline.correct.noise.percentile=0, do.plot=TRUE, intensity.weighted=FALSE)
Arguments
filename |
The cdf file name. If the file is not in the working directory, the path needs to be given. |
output_path |
Path to the output directory |
min.pres |
Run filter parameter. The minimum proportion of presence in the time period for a series of signals grouped by m/z to be considered a peak. |
min.run |
Run filter parameter. The minimum length of elution time for a series of signals grouped by m/z to be considered a peak. |
tol |
m/z tolerance level for the grouping of data points. This value is expressed as the fraction of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. The recommended value is the machine's nominal accuracy level. Divide the ppm value by 1e6. For FTMS, 1e-5 is recommended. |
baseline.correct |
After grouping the observations, the highest intensity in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, in which case the program uses a percentile of the height of the noise groups. If given a value, the value will be used as the threshold, and baseline.correct.noise.percentile will be ignored. |
baseline.correct.noise.percentile |
The perenctile of signal strength of those EIC that don't pass the run filter, to be used as the baseline threshold of signal strength. |
do.plot |
Whether to generate diagnostic plots. |
intensity.weighted |
Whether to weight the local density by signal intensities. |
Details
The subroutine takes CDF, mxml etc LC/MS profile. The m/z are grouped based on the tolerance level using multi-stage smoothing and peak finding. Non-parametric density estimation is used in both m/z dimension and elution time dimension to fine-tune the signal grouping. A run filter is applied, which requires a "true peak" to have a minimum length in the retention time dimension (parameter: min.run), as well as being detected at or higher than a proportion of the time points within the time period (parameter: min.pres).
Value
A matrix with four columns: m/z value, retention time, intensity, and group number.
Author(s)
Tianwei Yu <tyu8@emory.edu>
Compute a 2D Binned Kernel Density Estimate from LC/MS data in CDF format.
Description
This function provided a method to compute the density estimate of a LC/MS data matrix based on each point's density. It will return a set of peak's centre information including the point's coordinate in each coordinate axis and all the distances between the peak point and grid boundaries.
Usage
proc.cdf.2d(filename, mz.cut = 5e-4, rt.cut = 50, mz.search.range = 2e-3,
rt.search.range = 200, mz.search.step = 5e-4, rt.search.step = 50,
intensity.limit.quantile = 0.1, bPlot = FALSE, transform.mz=FALSE, transform.mz.const=0.1)
Arguments
filename |
The cdf file name. If the file is not in the working directory, the path needs to be given. |
mz.cut |
The divided gird width in m/z when calculate the density of each point. |
rt.cut |
The divided gird width in RT when calculate the density of each point. |
mz.search.range |
maximum peak width in m/z |
rt.search.range |
(maximum peak width in RT |
mz.search.step |
maximum search step in m/z |
rt.search.step |
maximum search step in RT |
intensity.limit.quantile |
intensity threshold |
bPlot |
Whether to plot |
transform.mz |
Whether to apply a nonlinear transformation to m/z values before alignment. |
transform.mz.const |
A constant used in the m/z transformation function |
Value
finalMatrix |
A matrix contains the information of peaks. Each row contains one peak's information and each colunm represent one aspect of the peak's information. Column 1 's value represent the X Position of each peak's centre. Column 2 's value represent the Y Position of each peak's centre. Column 3 's value represent the distance between centre of the peak and the top boudary of divided grid. Column 4 's value represent the distance between centre of the peak and the bottom boudary of divided grid. Column 5 's vlaue represent the peak's value . Column 6 's value represent the distance between centre of the peak and the left boudary of divided grid. Column 7 's value represent the distance between centre of the peak and the right boudary of divided grid. |
Examples
library(msdata)
filepath <- system.file("microtofq", package = "msdata")
file <- list.files(filepath, pattern="MM14.mzML",
full.names=TRUE, recursive = TRUE)
peakInfo <- proc.cdf.2d(file)
Filter noise and detect peaks from LC/MS data in text format
Description
This function applies the run filter to remove noise. Data points are grouped into EICs in this step.
Usage
proc.txt(filename, output_path, min.pres=0.5, min.run=12,tol=NA, find.tol.maxd=1e-4,
baseline.correct.noise.percentile=0.25, baseline.correct=0)
Arguments
filename |
The text file name. If the file is not in the working directory, the path needs to be given. |
output_path |
Path to the output directory |
min.pres |
Run filter parameter. The minimum proportion of presence in the time period for a series of signals grouped by m/z to be considered a peak. |
min.run |
Run filter parameter. The minimum length of elution time for a series of signals grouped by m/z to be considered a peak. |
tol |
m/z tolerance level for the grouping of data points. This value is expressed as the fraction of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. The recommended value is the machine's nominal accuracy level. Divide the ppm value by 1e6. For FTMS, 1e-5 is recommended. |
find.tol.maxd |
maximum distance between datapoints that are allowed in the procedure to find tolerance. |
baseline.correct |
After grouping the observations, the highest intensity in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, in which case the program uses the 75th percentile of the height of the noise groups. |
baseline.correct.noise.percentile |
The perenctile of signal strength of those EIC that don't pass the run filter, to be used as the baseline threshold of signal strength. |
Details
The columns in the text file need to be separated by tab. The first column needs to be the retention time, the second column the m/z values, and the third column the intensity values. The first row needs to be the column labels, rather than values. The m/z are grouped based on the tolerance level using multi-stage smoothing and peak finding. Non-parametric density estimation is used in both m/z dimension and elution time dimension to fine-tune the signal grouping. A run filter is applied, which requires a "true peak" to have a minimum length in the retention time dimension (parameter: min.run), as well as being detected at or higher than a proportion of the time points within the time period (parameter: min.pres).
Value
A matrix with four columns: m/z value, retention time, intensity, and group number.
Author(s)
Tianwei Yu <tyu8@emory.edu>
Sample profile data after noise filtration by the run filter
Description
A list object containing 4 matrices. Each matrix is from an LC/MS profile.
Usage
data(prof)
Format
Each matrix contains 4 columns: m/z, retention time, intensity, and group number.
Source
Data from Dean Jones lab, Emory University School of Medicine.
Examples
data(prof)
present.cdf.3d(prof[[2]],time.lim=c(250,400), mz.lim=c(400,500))
this.feature<-prof.to.features(prof[[1]])
Generate feature table from noise-removed LC/MS profile
Description
Each LC/MS profile is first processed by the function proc.cdf() to remove noise and reduce data size. A matrix containing m/z value, retention time, intensity, and group number is output from proc.cdf(). This matrix is then fed to the function prof.to.features() to generate a feature list. Every detected feature is summarized into a single row in the output matrix from this function.
Usage
prof.to.features(a, bandwidth=0.5, min.bw=NA, max.bw=NA, sd.cut=c(0.1, 100),
sigma.ratio.lim=c(0.1, 10), shape.model="bi-Gaussian", estim.method="moment",
do.plot=TRUE, power=1, component.eliminate=0.01, BIC.factor=2)
Arguments
a |
The matrix output from proc.cdf(). It contains columns of m/z value, retention time, intensity and group number. |
bandwidth |
A value between zero and one. Multiplying this value to the length of the signal along the time axis helps determine the bandwidth in the kernel smoother used for peak identification. See the details section. |
min.bw |
The minimum bandwidth to use in the kernel smoother. See the details section. |
max.bw |
The maximum bandwidth to use in the kernel smoother. See the details section. |
sd.cut |
A vector of two. Features with standard deviation outside the range defined by the two numbers are eliminated. |
sigma.ratio.lim |
A vector of two. It enforces the belief of the range of the ratio between the left-standard deviation and the righ-standard deviation of the bi-Gaussian fuction used to fit the data. |
shape.model |
The mathematical model for the shape of a peak. There are two choices - "bi-Gaussian" and "Gaussian". When the peaks are asymmetric, the bi-Gaussian is better. The default is "bi-Gaussian". |
estim.method |
The estimation method for the bi-Gaussian peak model. Two possible values: moment and EM. |
do.plot |
Whether to generate diagnostic plots. |
component.eliminate |
In fitting mixture of bi-Gaussian (or Gaussian) model of an EIC, when a component accounts for a proportion of intensities less than this value, the component will be ignored. |
power |
The power parameter for data transformation when fitting the bi-Gaussian or Gaussian mixture model in an EIC. |
BIC.factor |
the factor that is multiplied on the number of parameters to modify the BIC criterion. If larger than 1, models with more peaks are penalized more. |
Details
This function generates the feature table from the noise-removed profile. The m/z values are already grouped by the function proc.cdf() to generate EICs. The task of this function is to look at every EIC and determine (1) how many peaks there are, and (2) the location, spread and area of each peak. For the first task, when a series of signals is found at an m/z group, kernel smoother is fit along the time axis to determine whether there is one single peak or multiple peaks. The bandwidth of the kernel smoother is determined as follows: multiply the length of the signals by the bandwidth parameter. If the resulting value is between the parameters min.bw and max.bw, use that value; else if the value is below min.bw, use min.bw; else if the value is above max.bw, use max.bw. The default values of min.bw and max.bw are NA, in which case min.bw is set to be 1/30 of the retention time range, and max.bw is set to be 1/15 of the retention time range. A modified EM algorithm which allows missing completely at random at certain time points is used for the evaluation of the peak location and area. If a single peak is detected by the kernel smoother, the maximum likelihood normal curve is fitted. If multiple peaks are detected, the EM algorithm finds the normal mixture that best explain the data.
Value
A matrix is returned. The columns are: m/z value, retention time, spread (standard deviation of the estimated normal curve), and estimated total signal strength (total area of the estimated normal curve).
Author(s)
Tianwei Yu <tyu8@sph.emory.edu>
See Also
proc.cdf
Examples
data(prof)
this.feature<-prof.to.features(prof[[1]])
this.feature[1:5,]
Recover weak signals in some profiles that is not identified as a peak, but corresponds to identified peaks in other spectra.
Description
Given the aligned feature table, some features are identified in a subgroup of spectra. This doesn't mean they don't exist in the other spectra. The signal could be too low to pass the run filter. Thus after obtaining the aligned feature table, this function re-analyzes each spectrum to try and fill in the holes in the aligned feature table.
Usage
recover.weaker(filename, loc, aligned.ftrs, pk.times, align.mz.tol,
align.chr.tol, this.f1, this.f2, mz.range = NA,
chr.range = NA, use.observed.range = TRUE, orig.tol =
1e-05, min.bw = NA, max.bw = NA, bandwidth = 0.5,
recover.min.count = 3, intensity.weighted=FALSE)
Arguments
filename |
the cdf file name from which weaker signal is to be recovered. |
loc |
the location of the filename in the vector of filenames. |
aligned.ftrs |
matrix, with columns of m/z values, elution times, signal strengths in each spectrum. |
pk.times |
matrix, with columns of m/z, median elution time, and elution times in each spectrum. |
align.mz.tol |
the m/z tolerance used in the alignment. |
align.chr.tol |
the elution time tolerance in the alignment. |
this.f1 |
The matrix which is the output from proc.to.feature(). |
this.f2 |
The matrix which is the output from proc.to.feature(). The retention time in this object have been adjusted by the function adjust.time(). |
orig.tol |
The mz.tol parameter provided to the proc.cdf() function. This helps retrieve the intermediate file. |
mz.range |
The m/z around the feature m/z to search for observations. The default value is NA, in which case 1.5 times the m/z tolerance in the aligned object will be used. |
chr.range |
The retention time around the feature retention time to search for observations. The default value is NA, in which case 0.5 times the retention time tolerance in the aligned object will be used. |
use.observed.range |
If the value is TRUE, the actual range of the observed locations of the feature in all the spectra will be used. |
min.bw |
The minimum bandwidth to use in the kernel smoother.See the details section. |
max.bw |
The maximum bandwidth to use in the kernel smoother.See the details section. |
bandwidth |
A value between zero and one. Multiplying this value to the length of the signal along the time axis helps determine the bandwidth in the kernel smoother used for peak identification. See the details section. |
recover.min.count |
minimum number of raw data points to support a recovery. |
intensity.weighted |
Whether to use intensity to weight mass density estimation. |
Details
For every feature, if it is not present in a spectrum, open the spectrum, and look around the m/z and elution time location of the feature. The observed intensities with m/z and elution time most consistent with the feature are collected. The peak location and intensity is evaluated. For each spectrum, the partially processed file: .rawprof is loaded. This file is the product of the function proc.cdf(). The m/z values are already grouped and the median taken. The function searches around the feature m/z and retention time. When a series of signals is found at an m/z group, kernel smoother is fit along the time axis to determine whether there is one single peak or multiple peaks. The bandwidth of the kernel smoother is determined as follows: multiply the length of the signals by the bandwidth parameter. If the resulting value is between min.bw and max.bw, use that value; else if the value is below min.bw, use min.bw; else if the value is above max.bw, use max.bw. The default values of min.bw and max.bw are NA, in which case min.bw is set to be 1/30 of the retention time range, and max.bw is set to be 1/15 of the retention time range. A modified EM algorithm which allows missing completely at random at certain time points is used for the evaluation of the peak location and area. If a single peak is detected by the kernel smoother, the maximum likelihood normal curve is fitted. If multiple peaks are detected, the EM algorithm finds the normal mixture that best explain the data. After finding the peaks around the target feature, find the closest one to the target feature and record its information in the $aligned.ftrs and $pk.times matrices.
Value
Returns a list object with the following objects in it:
aligned.ftrs |
A matrix, with columns of m/z values, elution times, and signal strengths in each spectrum. |
pk.times |
A matrix, with columns of m/z, median elution time, and elution times in each spectrum. |
mz.tol |
The m/z tolerance in the aligned object. |
chr.tol |
The elution time tolerance in the aligned object. |
Author(s)
Tianwei Yu <tyu8@sph.emory.edu>
Removing long ridges at the same m/z.
Description
This is an internal function. It substracts a background estimated through kernel smoothing when an EIC continuously span more than half the retention time range.
Usage
rm.ridge(x, y2, bw)
Arguments
x |
Retetion time vector. |
y2 |
Intensity vector. |
bw |
Bandwidth for the kernel smoother. A very wide one is used here. |
Value
A vector of intensity value is returned.
Author(s)
Tianwei Yu <tyu8@emory.edu>
References
Bioinformatics. 25(15):1930-36. BMC Bioinformatics. 11:559.
Semi-supervised feature detection
Description
The semi-supervised procedure utilizes a database of known metabolites and previously detected features to identify features in a new dataset. It is recommended ONLY for experienced users. The user may need to construct the known feature database that strictly follows the format described below.
Usage
semi.sup(folder, output_path, file.pattern = ".cdf", known.table = NA, n.nodes = 4,
min.exp = 2, min.pres = 0.5, min.run = 12, mz.tol = 1e-5,
baseline.correct.noise.percentile = 0.05, shape.model = "bi-Gaussian", BIC.factor = 2,
baseline.correct = 0, peak.estim.method = "moment", min.bw = NA, max.bw = NA,
sd.cut = c(0.01, 500), sigma.ratio.lim = c(0.01, 100), component.eliminate = 0.01,
moment.power = 1, subs = NULL, align.mz.tol = NA, align.chr.tol = NA,
max.align.mz.diff = 0.01, pre.process = FALSE, recover.mz.range = NA,
recover.chr.range = NA, use.observed.range = TRUE, match.tol.ppm = NA,
new.feature.min.count = 2, recover.min.count = 3, intensity.weighted = FALSE)
Arguments
folder |
The folder where all CDF files to be processed are located. For example "C:/CDF/this_experiment" |
output_path |
Path to the output directory |
file.pattern |
The pattern in the names of the files to be processed. The default is ".cdf". Other formats supported by mzR package can also be used, e.g. "mzML" etc. |
known.table |
A data frame containing the known metabolite ions and previously found features. It contains 18 columns: "chemical_formula": the chemical formula if knonw; "HMDB_ID": HMDB ID if known; "KEGG_compound_ID": KEGG compound ID if known; "neutral.mass": the neutral mass if known: "ion.type": the ion form, such as H+, Na+, ..., if known; "m.z": m/z value, either theoretical for known metabolites, or mean observed value for unknown but previously found features; "Number_profiles_processed": the total number of LC/MS profiles that were used to build this database; "Percent_found": in what percentage was this feature found historically amount all data processed in building this database; "mz_min": the minimum m/z value observed for this feature; "mz_max": the maximum m/z value observed for this feature; "RT_mean": the mean retention time observed for this feature; "RT_sd": the standard deviation of retention time observed for this feature; "RT_min": the minimum retention time observed for this feature; "RT_max": the maximum retention time observed for this feature; "int_mean.log.": the mean log intensity observed for this feature; "int_sd.log.": the standard deviation of log intensity observed for this feature; "int_min.log.": the minimum log intensity observed for this feature; "int_max.log.": the maximum log intensity observed for this feature; |
n.nodes |
The number of CPU cores to be used through doSNOW. |
min.exp |
If a feature is to be included in the final feature table, it must be present in at least this number of spectra. |
min.pres |
This is a parameter of thr run filter, to be passed to the function proc.cdf(). Please see the help for proc.cdf() for details. |
min.run |
This is a parameter of thr run filter, to be passed to the function proc.cdf(). Please see the help for proc.cdf() for details. |
subs |
If not all the CDF files in the folder are to be processed, the user can define a subset using this parameter. For example, subs=15:30, or subs=c(2,4,6,8) |
mz.tol |
The user can provide the m/z tolerance level for peak identification. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. Please see the help for proc.cdf() for details. |
baseline.correct.noise.percentile |
The perenctile of signal strength of those EIC that don't pass the run filter, to be used as the baseline threshold of signal strength. This parameter is passed to proc.cdf() |
shape.model |
The mathematical model for the shape of a peak. There are two choices - bi-Gaussian and Gaussian. When the peaks are asymmetric, the bi-Gaussian is better. The default is bi-Gaussian. |
BIC.factor |
the factor that is multiplied on the number of parameters to modify the BIC criterion. If larger than 1, models with more peaks are penalized more. |
baseline.correct |
This is a parameter in peak detection. After grouping the observations, the highest observation in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, which allows the program to search for the cutoff level. Please see the help for proc.cdf() for details. |
peak.estim.method |
the bi-Gaussian peak parameter estimation method, to be passed to subroutine prof.to.features. Two possible values: moment and EM. |
min.bw |
The minimum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
max.bw |
The maximum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
sd.cut |
A parameter for the prof.to.features() function. A vector of two. Features with standard deviation outside the range defined by the two numbers are eliminated. |
sigma.ratio.lim |
A parameter for the prof.to.features() function. A vector of two. It enforces the belief of the range of the ratio between the left-standard deviation and the righ-standard deviation of the bi-Gaussian fuction used to fit the data. |
component.eliminate |
In fitting mixture of bi-Gaussian (or Gaussian) model of an EIC, when a component accounts for a proportion of intensities less than this value, the component will be ignored. |
moment.power |
The power parameter for data transformation when fitting the bi-Gaussian or Gaussian mixture model in an EIC. |
align.chr.tol |
The user can provide the elution time tolerance level to override the program's selection. This value is in the same unit as the elution time, normaly seconds. Please see the help for match.time() for details. |
align.mz.tol |
The user can provide the m/z tolerance level for peak alignment to override the program's selection. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level.Please see the help for feature.align() for details. |
max.align.mz.diff |
As the m/z tolerance in alignment is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
pre.process |
Logical. If true, the program will not perform time correction and alignment. It will only generate peak tables for each spectra and save the files. It allows manually dividing the task to multiple machines. |
recover.mz.range |
A parameter of the recover.weaker() function. The m/z around the feature m/z to search for observations. The default value is NA, in which case 1.5 times the m/z tolerance in the aligned object will be used. |
recover.chr.range |
A parameter of the recover.weaker() function. The retention time around the feature retention time to search for observations. The default value is NA, in which case 0.5 times the retention time tolerance in the aligned object will be used. |
use.observed.range |
A parameter of the recover.weaker() function. If the value is TRUE, the actual range of the observed locations of the feature in all the spectra will be used. |
match.tol.ppm |
The ppm tolerance to match identified features to known metabolites/features. |
new.feature.min.count |
The number of profiles a new feature must be present for it to be added to the database. |
recover.min.count |
The minimum time point count for a series of point in the EIC for it to be considered a true feature. |
intensity.weighted |
Whether to weight the local density by signal intensities. |
Details
The function first conducts a unsupervised feature detection in the new dataset. It then matches the newly identified features to the database. Then merging unfound features in the database and the newly found features, a weak signal recovery is performed. The final feature table is used to update the database.
Value
A list is returned.
features |
A list object, each component of which being the peak table from a single spectrum. |
features2 |
A list object, each component of which being the peak table from a single spectrum, after elution time correction. |
aligned.ftrs |
Feature table BEFORE weak signal recovery. |
final.ftrs |
Feature table after weak signal recovery. This is the end product of the function. |
pk.times |
Table of feature elution time BEFORE weak signal recovery. |
final.times |
Table of feature elution time after weak signal recovery. |
mz.tol |
The input mz.tol value by the user. |
align.mz.tol |
The m/z tolerance level in the alignment across spectra, either input from the user or automatically selected when the user input is NA. |
align.chr.tol |
The retention time tolerance level in the alignment across spectra, either input from the user or automatically selected when the user input is NA. |
updated.known.table |
The known table updated using the newly processed data. It should be used for future datasets generated using the same machine and LC column. |
ftrs.known.table.pairing |
The paring information between the feature table of the current dataset and the known feature tabel. |
intensity.weighted |
Whether to weight the local density by signal intensities in the initial peak detection stage. |
Author(s)
Tianwei Yu < tianwei.yu@emory.edu>
See Also
cdf.to.ftrs, proc.cdf, prof.to.feature, adjust.time, feature.align, recover.weaker
Semi-supervised feature detection using 2D peak detection
Description
The semi-supervised procedure utilizes a database of known metabolites and previously detected features to identify features in a new dataset. It is recommended ONLY for experienced users. The user may need to construct the known feature database that strictly follows the format described below.
Usage
semi.sup.2d(folder, output_path, file.pattern=".cdf", known.table=NA, n.nodes=4,
min.exp=2, mz.cut = 5e-5, rt.cut = 50,mz.search.range = 2e-4, rt.search.range = 200,
intensity.limit.quantile = 0.05, mPower=4, mz.tol=1e-5, subs=NULL, align.mz.tol=NA,
align.chr.tol=NA, max.align.mz.diff=0.01, pre.process=FALSE, recover.mz.range=NA,
recover.chr.range=NA, use.observed.range=TRUE, match.tol.ppm=NA, new.feature.min.count=2,
recover.min.count=3, intensity.weighted=FALSE)
Arguments
folder |
The folder where all CDF files to be processed are located. For example "C:/CDF/this_experiment" |
output_path |
Path to the output directory |
file.pattern |
The pattern in the names of the files to be processed. The default is ".cdf". Other formats supported by mzR package can also be used, e.g. "mzML" etc. |
known.table |
A data frame containing the known metabolite ions and previously found features. It contains 18 columns: "chemical_formula": the chemical formula if knonw; "HMDB_ID": HMDB ID if known; "KEGG_compound_ID": KEGG compound ID if known; "neutral.mass": the neutral mass if known: "ion.type": the ion form, such as H+, Na+, ..., if known; "m.z": m/z value, either theoretical for known metabolites, or mean observed value for unknown but previously found features; "Number_profiles_processed": the total number of LC/MS profiles that were used to build this database; "Percent_found": in what percentage was this feature found historically amount all data processed in building this database; "mz_min": the minimum m/z value observed for this feature; "mz_max": the maximum m/z value observed for this feature; "RT_mean": the mean retention time observed for this feature; "RT_sd": the standard deviation of retention time observed for this feature; "RT_min": the minimum retention time observed for this feature; "RT_max": the maximum retention time observed for this feature; "int_mean.log.": the mean log intensity observed for this feature; "int_sd.log.": the standard deviation of log intensity observed for this feature; "int_min.log.": the minimum log intensity observed for this feature; "int_max.log.": the maximum log intensity observed for this feature; |
n.nodes |
The number of CPU cores to be used through doSNOW. |
min.exp |
If a feature is to be included in the final feature table, it must be present in at least this number of spectra. |
mz.cut |
The divided gird width in m/z when calculate the density of each point. |
rt.cut |
The divided gird width in RT when calculate the density of each point. |
mz.search.range |
maximum peak width in m/z |
rt.search.range |
(maximum peak width in RT |
intensity.limit.quantile |
intensity threshold |
mPower |
The power parameter for data transformation when fitting the bi-Gaussian or Gaussian mixture model in an EIC. |
subs |
If not all the CDF files in the folder are to be processed, the user can define a subset using this parameter. For example, subs=15:30, or subs=c(2,4,6,8) |
mz.tol |
The user can provide the m/z tolerance level for peak identification. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. Please see the help for proc.cdf() for details. |
align.mz.tol |
The user can provide the m/z tolerance level for peak alignment to override the program's selection. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level.Please see the help for feature.align() for details. |
align.chr.tol |
The user can provide the elution time tolerance level to override the program's selection. This value is in the same unit as the elution time, normaly seconds. Please see the help for match.time() for details. |
max.align.mz.diff |
As the m/z tolerance in alignment is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
pre.process |
Logical. If true, the program will not perform time correction and alignment. It will only generate peak tables for each spectra and save the files. It allows manually dividing the task to multiple machines. |
recover.mz.range |
A parameter of the recover.weaker() function. The m/z around the feature m/z to search for observations. The default value is NA, in which case 1.5 times the m/z tolerance in the aligned object will be used. |
recover.chr.range |
A parameter of the recover.weaker() function. The retention time around the feature retention time to search for observations. The default value is NA, in which case 0.5 times the retention time tolerance in the aligned object will be used. |
use.observed.range |
A parameter of the recover.weaker() function. If the value is TRUE, the actual range of the observed locations of the feature in all the spectra will be used. |
match.tol.ppm |
The ppm tolerance to match identified features to known metabolites/features. |
new.feature.min.count |
The number of profiles a new feature must be present for it to be added to the database. |
recover.min.count |
The minimum time point count for a series of point in the EIC for it to be considered a true feature. |
intensity.weighted |
Whether to weight the local density by signal intensities. |
Details
The function first conducts a unsupervised feature detection in the new dataset. It then matches the newly identified features to the database. Then merging unfound features in the database and the newly found features, a weak signal recovery is performed. The final feature table is used to update the database.
Value
A list is returned.
features |
A list object, each component of which being the peak table from a single spectrum. |
features2 |
A list object, each component of which being the peak table from a single spectrum, after elution time correction. |
aligned.ftrs |
Feature table BEFORE weak signal recovery. |
final.ftrs |
Feature table after weak signal recovery. This is the end product of the function. |
pk.times |
Table of feature elution time BEFORE weak signal recovery. |
final.times |
Table of feature elution time after weak signal recovery. |
mz.tol |
The input mz.tol value by the user. |
align.mz.tol |
The m/z tolerance level in the alignment across spectra, either input from the user or automatically selected when the user input is NA. |
align.chr.tol |
The retention time tolerance level in the alignment across spectra, either input from the user or automatically selected when the user input is NA. |
updated.known.table |
The known table updated using the newly processed data. It should be used for future datasets generated using the same machine and LC column. |
ftrs.known.table.pairing |
The paring information between the feature table of the current dataset and the known feature tabel. |
intensity.weighted |
Whether to weight the local density by signal intensities in the initial peak detection stage. |
Author(s)
Tianwei Yu < tianwei.yu@emory.edu>
See Also
cdf.to.ftrs, proc.cdf, prof.to.feature, adjust.time, feature.align, recover.weaker
Semi-supervised feature detection using machine learning approach.
Description
The semi-supervised procedure utilizes a database of known metabolites and previously detected features to identify features in a new dataset. It is recommended ONLY for experienced users. The user may need to construct the known feature database that strictly follows the format described below.
Usage
semi.sup.learn(folder, output_path, file.pattern=".cdf", known.table=NA, n.nodes=4,
min.exp=2, min.pres=0.3, min.run=4, mz.tol=1e-5, shape.model="bi-Gaussian",
baseline.correct=0, peak.estim.method="moment", min.bw=NA, max.bw=NA, sd.cut=c(0.01,500),
component.eliminate=0.01, moment.power=1, sigma.ratio.lim=c(0.01, 100), subs=NULL,
align.mz.tol=NA, align.chr.tol=NA, max.align.mz.diff=0.01, pre.process=FALSE,
recover.mz.range=NA, recover.chr.range=NA, use.observed.range=TRUE, match.tol.ppm=5,
new.feature.min.count=2, recover.min.count=3, use.learn=TRUE, ridge.smoother.window=50,
smoother.window=c(1, 5, 10),pos.confidence=0.99, neg.confidence=0.99, max.ftrs.to.use=10,
do.grp.reduce=TRUE, remove.bottom.ftrs=0, max.fpr=0.5, min.tpr=0.9,
intensity.weighted=FALSE)
Arguments
folder |
The folder where all CDF files to be processed are located. For example "C:/CDF/this_experiment" |
output_path |
Path to the output directory |
file.pattern |
The pattern in the names of the files to be processed. The default is ".cdf". Other formats supported by mzR package can also be used, e.g. "mzML" etc. |
known.table |
A data frame containing the known metabolite ions and previously found features. It contains 18 columns: "chemical_formula": the chemical formula if knonw; "HMDB_ID": HMDB ID if known; "KEGG_compound_ID": KEGG compound ID if known; "neutral.mass": the neutral mass if known: "ion.type": the ion form, such as H+, Na+, ..., if known; "m.z": m/z value, either theoretical for known metabolites, or mean observed value for unknown but previously found features; "Number_profiles_processed": the total number of LC/MS profiles that were used to build this database; "Percent_found": in what percentage was this feature found historically amount all data processed in building this database; "mz_min": the minimum m/z value observed for this feature; "mz_max": the maximum m/z value observed for this feature; "RT_mean": the mean retention time observed for this feature; "RT_sd": the standard deviation of retention time observed for this feature; "RT_min": the minimum retention time observed for this feature; "RT_max": the maximum retention time observed for this feature; "int_mean.log.": the mean log intensity observed for this feature; "int_sd.log.": the standard deviation of log intensity observed for this feature; "int_min.log.": the minimum log intensity observed for this feature; "int_max.log.": the maximum log intensity observed for this feature; |
n.nodes |
The number of CPU cores to be used through doSNOW. |
min.exp |
If a feature is to be included in the final feature table, it must be present in at least this number of spectra. |
min.pres |
This is a parameter of thr run filter, to be passed to the function proc.cdf(). Please see the help for proc.cdf() for details. |
min.run |
This is a parameter of thr run filter, to be passed to the function proc.cdf(). Please see the help for proc.cdf() for details. |
mz.tol |
The user can provide the m/z tolerance level for peak identification. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. Please see the help for proc.cdf() for details. |
shape.model |
The mathematical model for the shape of a peak. There are two choices - "bi-Gaussian" and "Gaussian". When the peaks are asymmetric, the bi-Gaussian is better. The default is "bi-Gaussian". |
baseline.correct |
This is a parameter in peak detection. After grouping the observations, the highest observation in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, which allows the program to search for the cutoff level. Please see the help for proc.cdf() for details. |
peak.estim.method |
the bi-Gaussian peak parameter estimation method, to be passed to subroutine prof.to.features. Two possible values: moment and EM. |
min.bw |
The minimum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
max.bw |
The maximum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
sd.cut |
A parameter for the prof.to.features() function. A vector of two. Features with standard deviation outside the range defined by the two numbers are eliminated. |
sigma.ratio.lim |
A parameter for the prof.to.features() function. A vector of two. It enforces the belief of the range of the ratio between the left-standard deviation and the righ-standard deviation of the bi-Gaussian fuction used to fit the data. |
subs |
If not all the CDF files in the folder are to be processed, the user can define a subset using this parameter. For example, subs=15:30, or subs=c(2,4,6,8) |
component.eliminate |
In fitting mixture of bi-Gaussian (or Gaussian) model of an EIC, when a component accounts for a proportion of intensities less than this value, the component will be ignored. |
moment.power |
The power parameter for data transformation when fitting the bi-Gaussian or Gaussian mixture model in an EIC. |
align.chr.tol |
The user can provide the elution time tolerance level to override the program's selection. This value is in the same unit as the elution time, normaly seconds. Please see the help for match.time() for details. |
align.mz.tol |
The user can provide the m/z tolerance level for peak alignment to override the program's selection. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level.Please see the help for feature.align() for details. |
max.align.mz.diff |
As the m/z tolerance in alignment is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
pre.process |
Logical. If true, the program will not perform time correction and alignment. It will only generate peak tables for each spectra and save the files. It allows manually dividing the task to multiple machines. |
recover.mz.range |
A parameter of the recover.weaker() function. The m/z around the feature m/z to search for observations. The default value is NA, in which case 1.5 times the m/z tolerance in the aligned object will be used. |
recover.chr.range |
A parameter of the recover.weaker() function. The retention time around the feature retention time to search for observations. The default value is NA, in which case 0.5 times the retention time tolerance in the aligned object will be used. |
use.observed.range |
A parameter of the recover.weaker() function. If the value is TRUE, the actual range of the observed locations of the feature in all the spectra will be used. |
match.tol.ppm |
The ppm tolerance to match identified features to known metabolites/features. |
new.feature.min.count |
The number of profiles a new feature must be present for it to be added to the database. |
recover.min.count |
The minimum time point count for a series of point in the EIC for it to be considered a true feature. |
use.learn |
whether to use machine learning approach. The default is TRUE. |
ridge.smoother.window |
The size of the smoother window used by the kernel smoother to remove long ridge noise from each EIC. |
smoother.window |
The smoother windows to use in data feature generation. |
pos.confidence |
The confidence level for the features matched to the known feature list. |
neg.confidence |
The confidence level for the features not matching to the known feature list. |
max.ftrs.to.use |
The maximum number of data features to use in a predictive model. |
do.grp.reduce |
Whether to reduce data features that are similar. It is based on data feature predictability. |
remove.bottom.ftrs |
The number of worst performing data features to remove before model building. |
max.fpr |
The proportion of unmatched features to be selected in the feature detection step. |
min.tpr |
The proportion of matched features to be selected in the feature detection step. |
intensity.weighted |
Whether to weight the local density by signal intensities in the initial peak detection stage. |
Details
The function first conducts a machine-learning feature detection in the new dataset. And the conducts the regular feature alignment, retention time adjustment and weak signal recovery.
Value
A list is returned.
features |
A list object, each component of which being the peak table from a single spectrum. |
features2 |
A list object, each component of which being the peak table from a single spectrum, after elution time correction. |
aligned.ftrs |
Feature table BEFORE weak signal recovery. |
final.ftrs |
Feature table after weak signal recovery. This is the end product of the function. |
pk.times |
Table of feature elution time BEFORE weak signal recovery. |
final.times |
Table of feature elution time after weak signal recovery. |
mz.tol |
The input mz.tol value by the user. |
align.mz.tol |
The m/z tolerance level in the alignment across spectra, either input from the user or automatically selected when the user input is NA. |
align.chr.tol |
The retention time tolerance level in the alignment across spectra, either input from the user or automatically selected when the user input is NA. |
updated.known.table |
The known table updated using the newly processed data. It should be used for future datasets generated using the same machine and LC column. |
ftrs.known.table.pairing |
The paring information between the feature table of the current dataset and the known feature tabel. |
Author(s)
Tianwei Yu < tianwei.yu@emory.edu>
See Also
cdf.to.ftrs, semi.sup, learn.cdf, prof.to.feature, adjust.time, feature.align, recover.weaker
Targeted search of metabolites with given m/z and (optional) retention time
Description
The function conducts targeted search only. The search is based on m/z and (optionally) retention time. If there are sufficient number of peaks (>=100) in each profile, the function will conduct retention time correction and peak alignment, in order to reduce potential redundancies.
Usage
target.search(folder, output_path, file.pattern = ".cdf", known.table = NA, n.nodes = 4,
min.exp = 2, min.bw = NA, max.bw = NA, subs = NULL, align.mz.tol = 2e-05,
align.chr.tol = 150, max.align.mz.diff = 0.01, recover.mz.range = NA,
recover.chr.range = NA, use.observed.range = TRUE, match.tol.ppm = 5,
new.feature.min.count = 2, recover.min.count = 3)
Arguments
folder |
The folder where all CDF files to be processed are located. For example "C:/CDF/this_experiment" |
output_path |
Path to the output directory |
file.pattern |
The pattern in the names of the files to be processed. The default is ".cdf". Other formats supported by mzR package can also be used, e.g. "mzML" etc. |
known.table |
A data frame containing the known metabolite ions and previously found features. It contains 18 columns: "chemical_formula": the chemical formula if knonw; "HMDB_ID": HMDB ID if known; "KEGG_compound_ID": KEGG compound ID if known; "neutral.mass": the neutral mass if known: "ion.type": the ion form, such as H+, Na+, ..., if known; "m.z": m/z value, either theoretical for known metabolites, or mean observed value for unknown but previously found features; "Number_profiles_processed": the total number of LC/MS profiles that were used to build this database; "Percent_found": in what percentage was this feature found historically amount all data processed in building this database; "mz_min": the minimum m/z value observed for this feature; "mz_max": the maximum m/z value observed for this feature; "RT_mean": the mean retention time observed for this feature; "RT_sd": the standard deviation of retention time observed for this feature; "RT_min": the minimum retention time observed for this feature; "RT_max": the maximum retention time observed for this feature; "int_mean.log.": the mean log intensity observed for this feature; "int_sd.log.": the standard deviation of log intensity observed for this feature; "int_min.log.": the minimum log intensity observed for this feature; "int_max.log.": the maximum log intensity observed for this feature; |
n.nodes |
The number of CPU cores to be used through doSNOW. |
min.exp |
If a feature is to be included in the final feature table, it must be present in at least this number of spectra. |
min.bw |
The minimum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
max.bw |
The maximum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
subs |
If not all the CDF files in the folder are to be processed, the user can define a subset using this parameter. For example, subs=15:30, or subs=c(2,4,6,8) |
align.chr.tol |
The user can provide the elution time tolerance level to override the program's selection. This value is in the same unit as the elution time, normaly seconds. Please see the help for match.time() for details. |
align.mz.tol |
The user can provide the m/z tolerance level for peak alignment to override the program's selection. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level.Please see the help for feature.align() for details. |
max.align.mz.diff |
As the m/z tolerance in alignment is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
recover.mz.range |
A parameter of the recover.weaker() function. The m/z around the feature m/z to search for observations. The default value is NA, in which case 1.5 times the m/z tolerance in the aligned object will be used. |
recover.chr.range |
A parameter of the recover.weaker() function. The retention time around the feature retention time to search for observations. The default value is NA, in which case 0.5 times the retention time tolerance in the aligned object will be used. |
use.observed.range |
A parameter of the recover.weaker() function. If the value is TRUE, the actual range of the observed locations of the feature in all the spectra will be used. |
match.tol.ppm |
The ppm tolerance to match identified features to known metabolites/features. |
new.feature.min.count |
The number of profiles a new feature must be present for it to be added to the database. |
recover.min.count |
The minimum time point count for a series of point in the EIC for it to be considered a true feature. |
Value
features |
A list object, each component of which being the peak table from a single spectrum. |
filled.ftrs |
The target features are filled one by one. Notice this table may contain duplicates if some target features are too close. |
reduced.ftrs |
If the number of target features are big enough (>=100 detected in each profile), retention time correction and peak alignments are conducted to generate this feature table without redundancy. |
filled.times |
The target features are filled one by one. This is the retention time table. Notice this table may contain duplicates if some target features are too close. |
reduced.times |
If the number of target features are big enough (>=100 detected in each profile), retention time correction and peak alignments are conducted to generate this feature table without redundancy. This is the retention time table of the aligned features. |
Author(s)
Tianwei Yu < tianwei.yu@emory.edu>
See Also
cdf.to.ftrs, proc.cdf, prof.to.feature, adjust.time, feature.align, recover.weaker
Two step hybrid feature detection.
Description
A two-stage hybrid feature detection and alignment procedure, for data generated in multiple batches.
Usage
two.step.hybrid(folder, info, min.within.batch.prop.detect=0.4,
min.within.batch.prop.report=0.5, min.batch.prop=0.5, batch.align.mz.tol=1e-5,
batch.align.chr.tol=50, file.pattern=".cdf", known.table=NA, n.nodes=4,
min.pres=0.5, min.run=12, mz.tol=1e-5, baseline.correct.noise.percentile=0.05,
shape.model="bi-Gaussian",baseline.correct=0, peak.estim.method="moment", min.bw=NA,
max.bw=NA, sd.cut=c(0.1, 100), sigma.ratio.lim=c(0.05, 20), component.eliminate=0.01,
moment.power=2, align.mz.tol=NA, align.chr.tol=NA, max.align.mz.diff=0.01,
pre.process=FALSE, recover.mz.range=NA, recover.chr.range=NA, use.observed.range=TRUE,
match.tol.ppm=NA, new.feature.min.count=2, recover.min.count=3)
Arguments
folder |
The folder where all CDF files to be processed are located. For example "C:/CDF/this_experiment" |
info |
A table with two columns. The first column is the file names, and the second column is the batch label of each file. |
min.within.batch.prop.detect |
A feature needs to be present in at least this proportion of the files, for it to be initially detected as a feature for a batch. This parameter replaces the "min.exp" parameter in semi.sup(). |
min.within.batch.prop.report |
A feature needs to be present in at least this proportion of the files, in a proportion of batches controlled by "min.batch.prop", to be included in the final feature table. This parameter replaces the "min.exp" parameter in semi.sup(). |
min.batch.prop |
A feature needs to be present in at least this proportion of the batches, for it to be considered in the entire data. |
batch.align.mz.tol |
The m/z tolerance in ppm for between-batch alignment. |
batch.align.chr.tol |
The RT tolerance for between-batch alignment. |
file.pattern |
The pattern in the names of the files to be processed. The default is ".cdf". Other formats supported by mzR package can also be used, e.g. "mzML" etc. |
known.table |
A data frame containing the known metabolite ions and previously found features. It contains 18 columns: "chemical_formula": the chemical formula if knonw; "HMDB_ID": HMDB ID if known; "KEGG_compound_ID": KEGG compound ID if known; "neutral.mass": the neutral mass if known: "ion.type": the ion form, such as H+, Na+, ..., if known; "m.z": m/z value, either theoretical for known metabolites, or mean observed value for unknown but previously found features; "Number_profiles_processed": the total number of LC/MS profiles that were used to build this database; "Percent_found": in what percentage was this feature found historically amount all data processed in building this database; "mz_min": the minimum m/z value observed for this feature; "mz_max": the maximum m/z value observed for this feature; "RT_mean": the mean retention time observed for this feature; "RT_sd": the standard deviation of retention time observed for this feature; "RT_min": the minimum retention time observed for this feature; "RT_max": the maximum retention time observed for this feature; "int_mean.log.": the mean log intensity observed for this feature; "int_sd.log.": the standard deviation of log intensity observed for this feature; "int_min.log.": the minimum log intensity observed for this feature; "int_max.log.": the maximum log intensity observed for this feature; |
n.nodes |
The number of CPU cores to be used through doSNOW. |
min.pres |
This is a parameter of thr run filter, to be passed to the function proc.cdf(). Please see the help for proc.cdf() for details. |
min.run |
This is a parameter of thr run filter, to be passed to the function proc.cdf(). Please see the help for proc.cdf() for details. |
mz.tol |
The user can provide the m/z tolerance level for peak identification. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. Please see the help for proc.cdf() for details. |
baseline.correct.noise.percentile |
The perenctile of signal strength of those EIC that don't pass the run filter, to be used as the baseline threshold of signal strength. This parameter is passed to proc.cdf() |
shape.model |
The mathematical model for the shape of a peak. There are two choices - "bi-Gaussian" and "Gaussian". When the peaks are asymmetric, the bi-Gaussian is better. The default is "bi-Gaussian". |
baseline.correct |
This is a parameter in peak detection. After grouping the observations, the highest observation in each group is found. If the highest is lower than this value, the entire group will be deleted. The default value is NA, which allows the program to search for the cutoff level. Please see the help for proc.cdf() for details. |
peak.estim.method |
the bi-Gaussian peak parameter estimation method, to be passed to subroutine prof.to.features. Two possible values: moment and EM. |
min.bw |
The minimum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
max.bw |
The maximum bandwidth in the smoother in prof.to.features(). Please see the help file for prof.to.features() for details. |
sd.cut |
A parameter for the prof.to.features() function. A vector of two. Features with standard deviation outside the range defined by the two numbers are eliminated. |
sigma.ratio.lim |
A parameter for the prof.to.features() function. A vector of two. It enforces the belief of the range of the ratio between the left-standard deviation and the righ-standard deviation of the bi-Gaussian fuction used to fit the data. |
component.eliminate |
In fitting mixture of bi-Gaussian (or Gaussian) model of an EIC, when a component accounts for a proportion of intensities less than this value, the component will be ignored. |
moment.power |
The power parameter for data transformation when fitting the bi-Gaussian or Gaussian mixture model in an EIC. |
align.chr.tol |
The user can provide the elution time tolerance level to override the program's selection. This value is in the same unit as the elution time, normaly seconds. Please see the help for match.time() for details. |
align.mz.tol |
The user can provide the m/z tolerance level for peak alignment to override the program's selection. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level.Please see the help for feature.align() for details. |
max.align.mz.diff |
As the m/z tolerance in alignment is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
pre.process |
Logical. If true, the program will not perform time correction and alignment. It will only generate peak tables for each spectra and save the files. It allows manually dividing the task to multiple machines. |
recover.mz.range |
A parameter of the recover.weaker() function. The m/z around the feature m/z to search for observations. The default value is NA, in which case 1.5 times the m/z tolerance in the aligned object will be used. |
recover.chr.range |
A parameter of the recover.weaker() function. The retention time around the feature retention time to search for observations. The default value is NA, in which case 0.5 times the retention time tolerance in the aligned object will be used. |
use.observed.range |
A parameter of the recover.weaker() function. If the value is TRUE, the actual range of the observed locations of the feature in all the spectra will be used. |
match.tol.ppm |
The ppm tolerance to match identified features to known metabolites/features. |
new.feature.min.count |
The number of profiles a new feature must be present for it to be added to the database. |
recover.min.count |
The minimum time point count for a series of point in the EIC for it to be considered a true feature. |
Details
The function first conducts hybrid feature detection and alignment in each batch separately. Then a between-batch RT correction and feature alignment is conducted. Weak signal recovery is conducted at the single feature table level.
Value
A list is returned.
batchwise.results |
A list. Each item in the list is the product of semi.sup() from a single batch. |
final.ftrs |
Feature table. This is the end product of the function. |
Author(s)
Tianwei Yu < tianwei.yu@emory.edu>
See Also
semi.sup, cdf.to.ftrs, proc.cdf, prof.to.feature, adjust.time, feature.align, recover.weaker
Two step hybrid feature detection using 2D peak detection.
Description
A two-stage hybrid feature detection and alignment procedure, for data generated in multiple batches.
Usage
two.step.hybrid.2d(folder, info, min.within.batch.prop.detect=0.4,
min.within.batch.prop.report=0.5, min.batch.prop=0.5, batch.align.mz.tol=1e-5,
batch.align.chr.tol=50, file.pattern=".cdf", known.table=NA, n.nodes=4, mz.cut = 1e-4,
rt.cut = 50, mz.search.range = 5e-4, rt.search.range = 200,
intensity.limit.quantile = 0.05, mPower=4, mz.tol=1e-5, align.mz.tol=NA, align.chr.tol=NA,
max.align.mz.diff=0.01, pre.process=FALSE, recover.mz.range=NA, recover.chr.range=NA,
use.observed.range=TRUE, match.tol.ppm=NA, new.feature.min.count=2, recover.min.count=3)
Arguments
folder |
The folder where all CDF files to be processed are located. For example "C:/CDF/this_experiment" |
info |
A table with two columns. The first column is the file names, and the second column is the batch label of each file. |
min.within.batch.prop.detect |
A feature needs to be present in at least this proportion of the files, for it to be initially detected as a feature for a batch. This parameter replaces the "min.exp" parameter in semi.sup(). |
min.within.batch.prop.report |
A feature needs to be present in at least this proportion of the files, in a proportion of batches controlled by "min.batch.prop", to be included in the final feature table. This parameter replaces the "min.exp" parameter in semi.sup(). |
min.batch.prop |
A feature needs to be present in at least this proportion of the batches, for it to be considered in the entire data. |
batch.align.mz.tol |
The m/z tolerance in ppm for between-batch alignment. |
batch.align.chr.tol |
The RT tolerance for between-batch alignment. |
file.pattern |
The pattern in the names of the files to be processed. The default is ".cdf". Other formats supported by mzR package can also be used, e.g. "mzML" etc. |
known.table |
A data frame containing the known metabolite ions and previously found features. It contains 18 columns: "chemical_formula": the chemical formula if knonw; "HMDB_ID": HMDB ID if known; "KEGG_compound_ID": KEGG compound ID if known; "neutral.mass": the neutral mass if known: "ion.type": the ion form, such as H+, Na+, ..., if known; "m.z": m/z value, either theoretical for known metabolites, or mean observed value for unknown but previously found features; "Number_profiles_processed": the total number of LC/MS profiles that were used to build this database; "Percent_found": in what percentage was this feature found historically amount all data processed in building this database; "mz_min": the minimum m/z value observed for this feature; "mz_max": the maximum m/z value observed for this feature; "RT_mean": the mean retention time observed for this feature; "RT_sd": the standard deviation of retention time observed for this feature; "RT_min": the minimum retention time observed for this feature; "RT_max": the maximum retention time observed for this feature; "int_mean.log.": the mean log intensity observed for this feature; "int_sd.log.": the standard deviation of log intensity observed for this feature; "int_min.log.": the minimum log intensity observed for this feature; "int_max.log.": the maximum log intensity observed for this feature; |
n.nodes |
The number of CPU cores to be used through doSNOW. |
mz.cut |
The divided gird width in m/z when calculate the density of each point. |
rt.cut |
The divided gird width in RT when calculate the density of each point. |
mz.search.range |
maximum peak width in m/z |
rt.search.range |
(maximum peak width in RT |
intensity.limit.quantile |
intensity threshold |
mPower |
The power parameter for data transformation when fitting the bi-Gaussian or Gaussian mixture model in an EIC. |
mz.tol |
The user can provide the m/z tolerance level for peak identification. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level. Please see the help for proc.cdf() for details. |
align.mz.tol |
The user can provide the m/z tolerance level for peak alignment to override the program's selection. This value is expressed as the percentage of the m/z value. This value, multiplied by the m/z value, becomes the cutoff level.Please see the help for feature.align() for details. |
align.chr.tol |
The user can provide the elution time tolerance level to override the program's selection. This value is in the same unit as the elution time, normaly seconds. Please see the help for match.time() for details. |
max.align.mz.diff |
As the m/z tolerance in alignment is expressed in relative terms (ppm), it may not be suitable when the m/z range is wide. This parameter limits the tolerance in absolute terms. It mostly influences feature matching in higher m/z range. |
pre.process |
Logical. If true, the program will not perform time correction and alignment. It will only generate peak tables for each spectra and save the files. It allows manually dividing the task to multiple machines. |
recover.mz.range |
A parameter of the recover.weaker() function. The m/z around the feature m/z to search for observations. The default value is NA, in which case 1.5 times the m/z tolerance in the aligned object will be used. |
recover.chr.range |
A parameter of the recover.weaker() function. The retention time around the feature retention time to search for observations. The default value is NA, in which case 0.5 times the retention time tolerance in the aligned object will be used. |
use.observed.range |
A parameter of the recover.weaker() function. If the value is TRUE, the actual range of the observed locations of the feature in all the spectra will be used. |
match.tol.ppm |
The ppm tolerance to match identified features to known metabolites/features. |
new.feature.min.count |
The number of profiles a new feature must be present for it to be added to the database. |
recover.min.count |
The minimum time point count for a series of point in the EIC for it to be considered a true feature. |
Details
The function first conducts hybrid feature detection and alignment in each batch separately. Then a between-batch RT correction and feature alignment is conducted. Weak signal recovery is conducted at the single feature table level.
Value
A list is returned.
batchwise.results |
A list. Each item in the list is the product of semi.sup() from a single batch. |
final.ftrs |
Feature table. This is the end product of the function. |
Author(s)
Tianwei Yu < tianwei.yu@emory.edu>
See Also
semi.sup, cdf.to.ftrs, proc.cdf, prof.to.feature, adjust.time, feature.align, recover.weaker