[BioC] Siggenes SAM analysis: log2 transformation and Understanding output
James W. MacDonald
jmacdon at uw.edu
Fri Feb 17 15:25:53 CET 2012
Hi David,
On 2/17/2012 8:46 AM, David Westergaard wrote:
> Hi Jim,
>
> Thank you for your thorough answers.
>
> Just to make it absolutely clear: p0 is the chance that a randomly
> picked gene is differentially expressed, when picking a gene from the
> WHOLE data set?
Yes. In Bayesian statistics there is the concept of a prior, which is
the estimate for a particular parameter that one usually comes up with
by using prior knowledge of the underlying process being studied. The
general idea being that your new observations shouldn't stray too far
from this prior estimate.
In SAM, as in limma, what we are doing is estimating the prior from all
genes under consideration (hence it is called an empirical Bayesian
method). Since we don't really have prior knowledge of the underlying
process, the idea is to compute some estimate using all genes, and use
that as our prior. So the prior in your case is 0.63, which is the
estimated proportion of genes in your study that are not differentially
expressed. So by definition, if you randomly choose a gene, there is an
estimated 63% probability that it isn't differentially expressed.
> Secondly, should I be wary of p0 values below/above some threshold? I
> am working on multiple data sets, and p0 seems to vary from 0.39-0.80
> Lastly, is there an option to specify a FDR threshold? I am running
> through multiple data sets, and I would like to automate it, instead
> of having to looking at a table for each one individually.
I wouldn't worry too much about the p0 values, unless they vary from
what you expect. For instance, if you are doing an experiment where you
expect lots of genes to be affected, and you get a p0 value of 0.8, then
you might wonder why siggenes disagrees with what you expect. But that
probably points to a problem with your expectation or the samples/data.
I don't see how you could automate things with siggenes. It is designed
to be run interactively.
<shameless plug>
I hate to point people away from siggenes, because I think Holger
Schwender has done a great job with the package, and continues to
support it, many years after finishing his Masters degree.
However, I have worked for years in microarray core facilities, and have
a whole suite of functions designed to automate the processing of
microarray analyses. The catch here is that these functions rely on the
limma package, which estimates FDR using p.adjust(), rather than via
permutation. So if you are interested in automation, particularly
automating the annotation/output side of things, take a look at
affycoretools.
</shameless plug>
Best,
Jim
>
> Best,
> David
>
>
>
> 2012/2/16 James W. MacDonald<jmacdon at uw.edu>:
>> Hi David,
>>
>>
>> On 2/15/2012 8:30 AM, David Westergaard wrote:
>>> Hello,
>>>
>>> I am currently working on a data set about kiwi consumption for my
>>> bachelors project. The data is available at
>>> http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030
>>>
>>> I'm abit confused as to how to interpret the output parameters,
>>> specifically p0. I've run the following code:
>>>
>>> dataset<- read.table("OAS_RMA.txt",header=TRUE)
>>> controls<-
>>> cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dataset$CEL25.1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CEL78.1,dataset$CEL9.1,dataset$CEL92.1)
>>> experiments<-
>>> cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,dataset$CEL31.1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$CEL57.1,dataset$CEL7.1)
>>>
>>> library('siggenes')
>>> datamatrix<- matrix(cbind(controls,experiments),ncol=19)
>>> y<- rep(0,19)
>>> y[11:19]<- 1
>>> gene_names<- as.character(dataset$Hybridization.REF)
>>> sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345)
>>>
>>> Output:
>>> AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances
>>>
>>> s0 = 0
>>>
>>> Number of permutations: 100
>>>
>>> MEAN number of falsely called variables is computed.
>>>
>>> Delta p0 False Called FDR cutlow cutup j2 j1
>>> 1 0.1 0.634 28335.89 37013 0.4851 -1.058 0.354 9709 27372
>>> 2 0.5 0.634 11200.82 21273 0.3336 -2.271 0.910 2447 35850
>>> 3 0.9 0.634 249.38 1522 0.1038 -3.374 3.088 541 53695
>>> 4 1.3 0.634 9.67 134 0.0457 -4.402 5.577 127 54669
>>> 5 1.7 0.634 0.69 20 0.0219 -5.596 Inf 20 54676
>>> 6 2.1 0.634 0 1 0 -9.072 Inf 1 54676
>>> 7 2.5 0.634 0 1 0 -9.072 Inf 1 54676
>>> 8 2.9 0.634 0 1 0 -9.072 Inf 1 54676
>>> 9 3.3 0.634 0 1 0 -9.072 Inf 1 54676
>>> 10 3.7 0.634 0 0 0 -Inf Inf 0 54676
>>>
>>> I'm using the rand parameter because results seems to vary a bit. p0
>>> is in this case 0.634, and I'm not sure how to interpret this. From
>>> literature, this is described as "Prior probability that a gene is not
>>> differentially expressed" - What does this exactly mean? Does this
>>> imply, that there is a ~63% percent chance, that the genes in
>>> question, are actually NOT differentially expressed?
>>
>> It means that about 63% of your genes appear to be not differentially
>> expressed. So if you choose a gene at random, there is a 63% probability
>> that you will choose one that isn't differentially expressed.
>>
>> However, depending on the value of Delta that you choose, the expectation is
>> that a far fewer percentage of the genes selected will be differentially
>> expressed. In other words, you are trying to grab genes with a higher
>> probability of differential expression, and you are then estimating what
>> percentage of those genes are still likely false positives (e.g., if you
>> choose a Delta of 1.3, you will get 134 significant genes, and will expect
>> that about 10 of those will be false positives).
>>
>>
>>> I've also found some varying sources saying that it is a good idea to
>>> log2 transform data before inputting into SAM. Does this still apply,
>>> and if so, why?
>>
>> This is because the t-test is based on means, which are not very robust to
>> outliers. Gene expression data tend to have a strong right skew, meaning
>> that most of the data are within a certain range, but there are some values
>> much higher. If you take logs, it tends to minimize the skew, so the large
>> values have less of an effect (on the linear scale, expression values range
>> from 0-65,000, on log2 scale, they range from 0-16). It doesn't matter what
>> base you use, but people have tended to use log base 2 because then a
>> difference of 1 indicates a two-fold difference on the linear scale.
>>
>> Best,
>>
>> Jim
>>
>>
>>> Best Regards,
>>>
>>> David Westergaard
>>> Undergraduate student
>>> Technical University of Denmark
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list