[BioC] Siggenes SAM analysis: log2 transformation and Understanding output

Fri Feb 17 15:25:53 CET 2012

Hi David,

On 2/17/2012 8:46 AM, David Westergaard wrote:
> Hi Jim,
>
> Thank you for your thorough answers.
>
> Just to make it absolutely clear: p0 is the chance that a randomly
> picked gene is differentially expressed, when picking a gene from the
> WHOLE data set?

Yes. In Bayesian statistics there is the concept of a prior, which is 
the estimate for a particular parameter that one usually comes up with 
by using prior knowledge of the underlying process being studied. The 
general idea being that your new observations shouldn't stray too far 
from this prior estimate.

In SAM, as in limma, what we are doing is estimating the prior from all 
genes under consideration (hence it is called an empirical Bayesian 
method). Since we don't really have prior knowledge of the underlying 
process, the idea is to compute some estimate using all genes, and use 
that as our prior. So the prior in your case is 0.63, which is the 
estimated proportion of genes in your study that are not differentially 
expressed. So by definition, if you randomly choose a gene, there is an 
estimated 63% probability that it isn't differentially expressed.

> Secondly, should I be wary of p0 values below/above some threshold? I
> am working on multiple data sets, and p0 seems to vary from 0.39-0.80
> Lastly, is there an option to specify a FDR threshold? I am running
> through multiple data sets, and I would like to automate it, instead
> of having to looking at a table for each one individually.

I wouldn't worry too much about the p0 values, unless they vary from 
what you expect. For instance, if you are doing an experiment where you 
expect lots of genes to be affected, and you get a p0 value of 0.8, then 
you might wonder why siggenes disagrees with what you expect. But that 
probably points to a problem with your expectation or the samples/data.

I don't see how you could automate things with siggenes. It is designed 
to be run interactively.

<shameless plug>
I hate to point people away from siggenes, because I think Holger 
Schwender has done a great job with the package, and continues to 
support it, many years after finishing his Masters degree.

However, I have worked for years in microarray core facilities, and have 
a whole suite of functions designed to automate the processing of 
microarray analyses. The catch here is that these functions rely on the 
limma package, which estimates FDR using p.adjust(), rather than via 
permutation. So if you are interested in automation, particularly 
automating the annotation/output side of things, take a look at 
affycoretools.
</shameless plug>

Best,

Jim

>
> Best,
> David
>
>
>
> 2012/2/16 James W. MacDonald<jmacdon at uw.edu>:
>> Hi David,
>>
>>
>> On 2/15/2012 8:30 AM, David Westergaard wrote:
>>> Hello,
>>>
>>> I am currently working on a data set about kiwi consumption for my
>>> bachelors project. The data is available at
>>> http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030
>>>
>>> I'm abit confused as to how to interpret the output parameters,
>>> specifically p0. I've run the following code:
>>>
>>> dataset<- read.table("OAS_RMA.txt",header=TRUE)
>>> controls<-
>>> cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dataset$CEL25.1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CEL78.1,dataset$CEL9.1,dataset$CEL92.1)
>>> experiments<-
>>> cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,dataset$CEL31.1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$CEL57.1,dataset$CEL7.1)
>>>
>>> library('siggenes')
>>> datamatrix<- matrix(cbind(controls,experiments),ncol=19)
>>> y<- rep(0,19)
>>> y[11:19]<- 1
>>> gene_names<- as.character(dataset$Hybridization.REF)
>>> sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345)
>>>
>>> Output:
>>> AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances
>>>
>>>   s0 = 0
>>>
>>>   Number of permutations: 100
>>>
>>>   MEAN number of falsely called variables is computed.
>>>
>>>     Delta    p0    False Called    FDR cutlow cutup   j2    j1
>>> 1    0.1 0.634 28335.89  37013 0.4851 -1.058 0.354 9709 27372
>>> 2    0.5 0.634 11200.82  21273 0.3336 -2.271 0.910 2447 35850
>>> 3    0.9 0.634   249.38   1522 0.1038 -3.374 3.088  541 53695
>>> 4    1.3 0.634     9.67    134 0.0457 -4.402 5.577  127 54669
>>> 5    1.7 0.634     0.69     20 0.0219 -5.596   Inf   20 54676
>>> 6    2.1 0.634        0      1      0 -9.072   Inf    1 54676
>>> 7    2.5 0.634        0      1      0 -9.072   Inf    1 54676
>>> 8    2.9 0.634        0      1      0 -9.072   Inf    1 54676
>>> 9    3.3 0.634        0      1      0 -9.072   Inf    1 54676
>>> 10   3.7 0.634        0      0      0   -Inf   Inf    0 54676
>>>
>>> I'm using the rand parameter because results seems to vary a bit. p0
>>> is in this case 0.634, and I'm not sure how to interpret this. From
>>> literature, this is described as "Prior probability that a gene is not
>>> differentially expressed" - What does this exactly mean? Does this
>>> imply, that there is a ~63% percent chance, that the genes in
>>> question, are actually NOT differentially expressed?
>>
>> It means that about 63% of your genes appear to be not differentially
>> expressed. So if you choose a gene at random, there is a 63% probability
>> that you will choose one that isn't differentially expressed.
>>
>> However, depending on the value of Delta that you choose, the expectation is
>> that a far fewer percentage of the genes selected will be differentially
>> expressed. In other words, you are trying to grab genes with a higher
>> probability of differential expression, and you are then estimating what
>> percentage of those genes are still likely false positives (e.g., if you
>> choose a Delta of 1.3, you will get 134 significant genes, and will expect
>> that about 10 of those will be false positives).
>>
>>
>>> I've also found some varying sources saying that it is a good idea to
>>> log2 transform data before inputting into SAM. Does this still apply,
>>> and if so, why?
>>
>> This is because the t-test is based on means, which are not very robust to
>> outliers. Gene expression data tend to have a strong right skew, meaning
>> that most of the data are within a certain range, but there are some values
>> much higher. If you take logs, it tends to minimize the skew, so the large
>> values have less of an effect (on the linear scale, expression values range
>> from 0-65,000, on log2 scale, they range from 0-16). It doesn't matter what
>> base you use, but people have tended to use log base 2 because then a
>> difference of 1 indicates a two-fold difference on the linear scale.
>>
>> Best,
>>
>> Jim
>>
>>
>>> Best Regards,
>>>
>>> David Westergaard
>>> Undergraduate student
>>> Technical University of Denmark
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor