[BioC] Question on PROcess package

Fri Dec 11 16:40:14 CET 2009

Hi Farida,
The key here is to choose the correct values for the lowcut and/or highcut options. Most MS-TOF experts will tell you that the counts (data) arriving at the detector for the first several hundred Daltons (even 1k-2k Da, depending on the technology, machine settings, and who you ask) is basically trash, and shouldn't be used. So it is during the very first few milliseconds (e.g. hundred Daltons) where most spectra don't agree, so you use lowcut to start the your data at the lowest Dalton of interest, or the lowest Dalton value that is common to all your spectra.

My function doesn't find these values (lowcut and highcut) for you automatically, because the investigator should know the Dalton range of interest that the 
MS analyzer was set up to detect. If you don't know that, I'd recommend 700 Da or 1000 Da for low laser-intensity experiments, which I'm assuming you are doing for simplicity's sake of this email.

I will say that if your starting Da (or ending Da) of ALL your spectra differ by more than 1-3 Da, something is wrong with the experiment, or data is being mixed together that shouldn't be. If that is the case, email me directly and I will try to help you.

Best of luck,
Wade

________________________________________
From: Farida Mostajabi [f0most01 at louisville.edu]
Sent: Thursday, December 10, 2009 2:36 PM
To: Davis, Wade; bioconductor at stat.math.ethz.ch
Subject: RE: [BioC] Question on PROcess package

Hi Wade,

Thank you for the code.My question on the code is:

On the part, you create empty matrix, the matrix dimension is chosen  based on the dimension of the first spectrum.

        ftemp <- read.files(file.path(fldr,paste(SpecNames), fsep ="\\")[1])
        ftemp2 <- ftemp[ftemp[, 1] > lowcut & ftemp[, 1] < highcut, ]
        bseoffM<-matrix(data=0.0123456,ncol=n,nrow=dim(ftemp2)[1])

 what if the dimention of other spectrum are different, which is the case for our problem? On the next part of the program, when it fills  the matrix elements with baseline corrected values, I receive this error

"number of items to replace is not a multiple of replacement length"

How did you approach this issue?

Thanks,

Farida

Hi Farida,
I used to use the PROcess package extensively, but I haven't much for the past 2 years. I ran into the same problem that you did, so I wrote a modified version of the rmBaseline function that fixes that, and does some other things that you may find handy later on. The parts that should interest you most are the highcut and lowcut options.

The normed and rawout options are not used that often.

I last ran this code under R 2.6.1, so I am not sure if it will work without a few tweaks.

Good luck,
Wade

 rmBaseline2<-function(fldr, bseoffrda = NULL, breaks = 200, qntl = 0, method = "loess",
                        lowcut=0, highcut=195000,   bw = 0.1, rawout=FALSE,normed=FALSE,
                        SpecNames = list.files(fldr, pattern = "\\.*csv\\.*"))
{

##################################################
## modified BATCH function for baseline subtraction
##################################################

 # Modified version of rmBaseline function in PROcess package.
 # This version allows you to specify the mass range to consider
 # for baseline removal via the inputs lowcut and highcut.
 # This was written to accounts for minor differences in the spectra length
 # due to the laser firing for slightly different lengths of time.
 #
 # Use rawout=TRUE if you want all of the spectra read in and
 # stored in a matrix without actually baseline subtracting.
 # (This is useful for taking advantage of plotting routines
 # that were originally written for spectra after they had been baseline subtracted.)
 #
 # The use of normed=T is more rare. It was written as part of an exploratory analysis
 # I did to see if it made a difference if you normalized, then baseline subtracted
 # rather than the traditional process of baseline subtracting and then normalizing
 # My analysis showed that it made no appreciable difference, so I am sticking
 # with the status quo.

 SpecNames.abbrev<-unlist(strsplit(SpecNames,split = " [0-9]{3} "))[seq(2,2*length(SpecNames),2)]

  if(normed==FALSE){
        fs <- SpecNames
        n <- length(fs)
        #peek at dimensions to create empty matrix
        ftemp <- read.files(file.path(fldr,paste(SpecNames), fsep ="\\")[1])
        ftemp2 <- ftemp[ftemp[, 1] > lowcut & ftemp[, 1] < highcut, ]
        bseoffM<-matrix(data=0.0123456,ncol=n,nrow=dim(ftemp2)[1])

            for (j in 1:n) {
                f1 <- read.files(file.path(fldr,paste(SpecNames), fsep ="\\")[j])
                fcut <- f1[f1[, 1] > lowcut & f1[, 1] < highcut, ]
                    if(rawout==FALSE){bseoffM[,j] <- bslnoff(fcut, breaks = breaks, qntl = qntl,
                    method = method, bw = bw)[,2]                                        }
                    if(rawout==TRUE){bseoffM[,j]<-fcut[,2]}
                    if (j==1){rownames(bseoffM) <- signif(bslnoff(fcut, breaks = breaks, qntl = qntl,                           method = method, bw = bw)[,1],6)
                    }
                            }

        colnames(bseoffM) <- SpecNames
                    }

    if(normed==TRUE){
        fs <- fldr
        n <- ncol(fs)
            for (j in 1:n) {
                f1 <- cbind(as.numeric(rownames(fs)),fs[,j])
                fcut <- f1[f1[, 1] > lowcut & f1[, 1] < highcut, ]
                    bseoff <- bslnoff(fcut, breaks = breaks, qntl = qntl,
                    method = method, bw = bw)
                    if (j > 1)
                        bseoffM <- cbind(bseoffM, bseoff[, 2])
                    else bseoffM <- bseoff[, 2]
                            }
        dimnames(bseoffM) <- list(signif(bseoff[, 1], 6), SpecNames=colnames(fldr))
                    }

        if (!is.null(bseoffrda))
            save(list = bseoffM, file = bseoffrda)
        bseoffM

}

 ##EXAMPLE
 #   rmBaseline2(fldr=seldipath(basedir="W:\\Master6\\Raw Specta",chiptype="IMAC",inten="high")
 #   ,breaks = 2
 #   ,qntl = 0
 #   ,method = "approx"
 #   ,bw = 0.1,
 #   highcut=50000
 #   )

-----Original Message-----
From: Farida Mostajabi [mailto:f0most01 at louisville.edu]
Sent: Monday, November 30, 2009 2:48 PM
To: bioconductor at stat.math.ethz.ch
Subject: [BioC] Question on PROcess package

To whom it may concern,

I am a student from University of Louisville, USA. I am currently doing some MALDI-TOF  MS data analysis research
with PROcess package.

I am trying to use the batch functionality of the package to do pre processing  on 286 spectra. The m/z values
are not exactly the same throughout the spectra,   which I think it is an assumption in PROcess package.

I used the code below to do  baseline correction for one spectrum  at a time

B.fs <- list.files(my.B.files, pattern = "\\.*csv\\.*", full.names = TRUE)
nb.file <- length(B.fs)

foo<-lapply(seq(nb.file), function(i) read.files(B.fs[i] ))
f0<-lapply(seq(nb.file), function(i) foo[[i]][foo[[i]][,1]>0,])
basecorr<-lapply(seq(nb.file), function(i)  bslnoff(f0[[i]], method = "loess", bw = 0.1))

I could not use "rmBaseline" function since the row-names of the returning matrix are  the m/z values,  which in my case, are not  identical.

Would you please give some suggestions on this issue?

Best Regards,

Farida