[BioC] Package plgem for analysis of spectral counts, was 'limma for spectral counts'

Mon Oct 25 00:11:24 CEST 2010

Hi Yolande,

>From the warning messages and the diagnostic plots that you got I have to agree with you that something strange was introduced into your data by the missing data imputation method. I personally never used (and never felt the need of using) any missing data imputation algorithm for proteomics data. Unlike microarray data, in MS-based proteomics certain proteins are sometimes identified while other times they are not. It's just part of the technology. Having many replicates (as you do!) is what helps increasing the chances of identifying also the least abundant proteins, which are typically identified in only a fraction of the replicate runs (see my Mol Cell Proteomics paper).

In my humble opinion, all one needs to do is replacing the missing values with zeros. By doing so, I was always able to fit a PLGEM to any NSAF dataset I have analyzed so far.

I strongly suggest you to re-try your analysis using your original normalized spectral counts, without missing data imputation, but simply replacing all missing data with zeros. The missing data imputation engine must rely on some assumptions, that are evidently not met in your dataset.

Hope this helps! Let me know how it works.

Cheers,
Norman

P.S. I also noticed that in your last R code you changed the value of argument 'rank' to 5000. This will not have any effect on your calculations, as argument 'rank' is only used when there's not enough replicates available to perform the resampling step (and this is clearly not your case!). You probably wanted to change argument 'Iterations' as per my previous suggestion! :-)

-----Original Message-----
From: Yolande Tra [mailto:yolande.tra at gmail.com] 
Sent: Sunday, October 24, 2010 3:36 PM
To: Pavelka, Norman
Cc: bioconductor at stat.math.ethz.ch
Subject: Re: Package plgem for analysis of spectral counts, was 'limma for spectral counts'

Hi Norman,

Thank you for your eagerness to help.
I rerun the plgem wrapper and obtained the attached plot.

 > LPSdegList <- run.plgem(esdata = exampleSet, signLev=0.05, rank=5000, trimAllZeroRows=TRUE) Warning messages:
1: In plgem.fit(data = esdata, covariate = covariate, fitCondition = fitCondition,  :
  PLGEM slope is lower than 0.5
2: In plgem.fit(data = esdata, covariate = covariate, fitCondition = fitCondition,  :
  Adjusted r^2 is lower than 0.95
3: In plgem.fit(data = esdata, covariate = covariate, fitCondition = fitCondition,  :
  Pearson correlation coefficient is lower than 0.85

I was wondering why the adjusted R--square was so small. The Pearson correlation improved a little bit but still is less than 0.85. The
ln(mean) and ln(sd) are all negative, which means small difference between the two conditions and small variation. The residual histogram is positively skewed.
The data have substantial of missing values. I used na.roughfix followed by RFImpute in Random Forest package to impute the missing spectral counts. I think the imputation skewed the output.
Do you know any appropriate method to impute missing spectral counts from mass spectrometry?

Yolande

On Sun, Oct 24, 2010 at 12:49 PM, Pavelka, Norman <NXP at stowers.org> wrote:
> Hi Yolande,
>
> The warning messages are telling you that the model is not fitting in the expected way to the data. From the diagnostic plots it is clear that something's not right: you can see two populations of values, one of which clustered around close-to-zero values. In proteomics data, unlike microarray data, there's often a large majority of data-points in a dataset that are either missing or zero. This causes problems in the fitting of the model, as you experienced. For analyzing proteomics data, I strongly recommend using option 'trimAllZeroRows=TRUE' to remove all rows that contain only zero values in a given condition.
>
> Also, depending on the size of your proteomics dataset (in terms of how many proteins were identified by the mass-spec), you may experience some instability in the results using the default number of iterations. Try running the plgem wrapper a few times one after the other, and if you notice that the number of selected proteins is very variable from run to run, then try increasing the number of 'Iterations' in command 'run.plgem' to 1000, 2000 or even 5000. The runs will take a bit longer, but you should get more stable results.
>
> Let me know how it works!
>
> Norman
>
> -----Original Message-----
> From: Yolande Tra [mailto:yolande.tra at gmail.com]
> Sent: Saturday, October 23, 2010 1:54 PM
> To: Pavelka, Norman
> Subject: Re: [BioC] limma for spectral counts
>
> Hi Norman,
>
> I also run the wrapper mode and obtain the attached diagnostic plots.
> There was no protein differentially expressed in the output. It is totally different from the tutorial example data set diagnostics. What do you think?
>
> LPSdegList <- run.plgem(esdata = exampleSet) Warning messages:
> 1: In plgem.fit(data = esdata, covariate = covariate, fitCondition = fitCondition,  :
>  PLGEM slope is lower than 0.5
> 2: In plgem.fit(data = esdata, covariate = covariate, fitCondition = fitCondition,  :
>  Adjusted r^2 is lower than 0.95
> 3: In plgem.fit(data = esdata, covariate = covariate, fitCondition = fitCondition,  :
>  Pearson correlation coefficient is lower than 0.85
>
> Yolande
> On Fri, Oct 22, 2010 at 7:49 PM, Pavelka, Norman <NXP at stowers.org> wrote:
>> Hi Yolande,
>>
>> You can try normalizing your specral counts following the NSAF (Normalized Spectral Abundance Factor) approach and then you can use package 'plgem' to detect your differentially abundant proteins. You can have a look at this publication to get an idea and then let me know if you need any help:
>>
>> http://www.ncbi.nlm.nih.gov/pubmed/18029349
>>
>> Thanks and good luck!
>> Norman
>>
>>
>> On 20 October 2010 14:20, Yolande Tra <yolande.tra at gmail.com> wrote:
>>> Hello list members,
>>>
>>> I was wondering if limma method can be used for spectral counts of 
>>> proteins from mass spectrometry. If yes, is there a function in 
>>> Bioconductor that normalizes these counts.before running limma.
>>>
>>> Thank you for your help,
>>>
>>> Yolande
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> Norman Pavelka, Ph.D.
>> Postdoctoral Research Associate
>> Rong Li lab
>> Stowers Institute for Medical Research 1000 E. 50th St.
>> Kansas City, MO 64110
>> U.S.A.
>>
>> phone: +1 (816) 926-4103
>> fax: +1 (816) 926-4658
>> e-mail: nxp at stowers.org
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>