Dear Sabrina,
Thank you for the diagnostic plots, they help a lot for evaluating the
fitting of the model.
In fact, though the residuals seem to be quite normally distributed
(which is good!), there is a quite strong non-linear component that
sits behind the ln(sd)-versus-ln(mean) dependence in your data (which
is bad!). Just to check if the model was fitted in the intended way,
could you please send me the output of:
pData(eset) # where 'eset' is the name of your 'exprSet'?
First possible reason for the poor fit: It looks as if your data were
normalized with RMA or maybe GCRMA, right? Did you remember to anti-log
the expression values before subjecting the 'exprSet' to 'run.plgem'?
PLGEM was designed to be fitted on linear values, not on logged
values... All you have to do is
eset.lin <- 2^eset
run.plgem('eset.lin')
Second possible reason: saturation! Use 'hist' or 'boxplot' to check if
there is a saturation peak arising on the right side of the histogram
or in the upper part of the boxplot (use the logged data in this
case...). If there is saturation, I'm afraid that 'plgem' won't help
you make your data better. But my guess is that you simply forgot to
anti-log the exprSet... ;-)
Coming to your last question: What happens if you repeat the analysis
using different values of 'signLev'? Does the percentage of maximum
expected FDR stay constant? Maybe if you use a smaller value of
'signLev', you could obtain a smaller number of more significant
findings...
Hope that helps!
Norman
Norman Pavelka
Department of Biotechnology and Bioscience
University of Milano-Bicocca
Piazza della Scienza, 2
20126 Milan, Italy
Phone: +39 02 6448 3556
Fax: +39 02 6448 3552
On 25 Jul 2005, at 09:50, Sabrina Carpentier wrote:
> Dear Norman,
>
> Thanks a lot for all these details.
> To answer to your first question I have biological replicates.
> I launched the function with the parameters which you advised me and I
> obtain the attached fittingEval. I would like your expert opinion :-)
> In fact I would like to conclude :"maybe it is correct". What do you
> think?
>
> Otherwise, I use the parameter signLev to determine the FDR and
> unfortunately, I have very poor results: When I choose the
> signLev=0.001 with 22626 genes in my chip, I found 48 DEG so my FDR <=
> 50% !!
> In my opinion, I have so few replicates, that's why the multiple
> tests correction is too strong to find DEG? What are yours opinions?
>
> Sincerely
>
> Sabrina
>> ----- Original Message -----
>> From: Norman Pavelka
>> To: Sabrina Carpentier
>> Cc: bioconductor@stat.math.ethz.ch
>> Sent: Thursday, July 21, 2005 11:11 AM
>> Subject: Re: plgem
>>
>> Dear Sabrina,
>>
>> I'm very glad that you contacted me about our Bioconductor package
>> 'plgem'. I will be happy to help you in using it in the best possible
>> way. Since your questions are quite general, I'm CCing this answer to
>> the BioC mailing list, in case somebody else had the same doubts or
>> would like to comment on some the features of this package.
>>
>> 1) In our hands, the minimum number of arrays for fitting the model
>> (PLGEM) is 3. But of course everything depends on the data
>> themselves: are these technical or biological replicates? In any
>> case, every time you would like to try 'plgem' on a new data set, be
>> sure to always set the parameter 'fitting.eval' in the 'run.plgem'
>> wrapper function to 'TRUE' (this is the default setting, actually).
>> This will produce a series of diagnostic plots, that you can
>> optionally choose to save to a png file, setting the parameter
>> 'plotFile=TRUE'. To judge if the model fits well to your data, you
>> should see the following:
>>
>> - The majority of your data points should indeed be scattered along
>> the fitted straight line in the ln(sd) vs. ln(mean) plot.
>> - The residuals vs. rank plot should produce a sort of regular
>> 'band', centered around a y-axis value of zero and with a constant
>> width along the x-axis.
>> - The histogram of the residuals should be approximately normal,
>> which you can easily judge by checking if the 'Normal Q-Q plot'
>> produces an approximately diagonal line.
>>
>> Just run the example provided in the vignette to have an idea of a
>> typical plot:
>>
>> library(plgem) # loads the package and all its dependencies
>> data(LPSeset) # loads the example data set
>> LPS.DEGlist <- run.plgem(LPSeset) # runs the wrapper using defaults
>> settings
>>
>> You will notice that the above requirements are not perfectly met in
>> this particular data set. With more replicates the model usually fits
>> better. Anyway, in a situation like the one above, the performance of
>> the method is still very good!
>>
>> 2) If your question is "Do you have a way to explicitly CONTROL the
>> FDR?", the answer in NO.
>> Actually there are very few methods that automatically do this. Also
>> SAM (Significance Analysis of Microarrays, Tusher VG et al., PNAS
>> 2001), which is said to do this, actually only provides you an
>> estimate of the FDR on your output result. It's up to you and your
>> choice of the input 'Delta' parameter to fine-tune the output FDR.
>> The problem there is that the relationship between the input 'Delta'
>> and the output 'FDR' is not predictable a priori and very much data
>> set-dependent.
>>
>> What we provide in our method is an input parameter (signLev) that we
>> have shown to be a good direct predictor of the number of false
>> positives that you expect to observe when actually not a single gene
>> should be differentially expressed. So, all you have to do is choose
>> the maximum number of false positives you accept to observe in your
>> output, e.g. 10, divide that number by the total number of genes in
>> your chip, e.g. 10/10000=0.001, use that number directly as the input
>> parameter 'signLev=0.001', and finally see how many genes you get
>> with those settings, e.g. 500. Then simply divide your chosen number
>> of false positives by the number of selected genes, e.g. 10/500=0.02,
>> and your actual FDR should be less than or equal to that number.
>>
>> Of course, also with our method, you will have to do different trials
>> if your goal is to get a particular FDR, because the number of
>> selected genes cannot be predicted a priori.
>>
>> Hope that helps!
>> Norman
>>
>> Norman Pavelka
>> Department of Biotechnology and Bioscience
>> University of Milano-Bicocca
>> Piazza della Scienza, 2
>> 20126 Milan, Italy
>>
>> Phone: +39 02 6448 3556
>> Fax: +39 02 6448 3552
>>
>> On 20 Jul 2005, at 16:14, Sabrina Carpentier wrote:
>>
>>>
>>>
>>> I tried your package plgem and I asked me if it is possible to apply
>>> it on my data set: I have only 4 arrays per groups, what do you
>>> think?
>>> Moreover, do you include in your algorithm a multiple test
>>> correction? I think that you include FDR , don't you?
>>>
>>> Thanks a lot for your precisions
>>>
>>> Sincerely,
>>>
>>> Sabrina
[[alternative text/enriched version deleted]]