[BioC] Siggenes SAM analysis: log2 transformation and Understanding output

Thu Feb 16 17:05:38 CET 2012

Hi David,

On 2/15/2012 8:30 AM, David Westergaard wrote:
> Hello,
>
> I am currently working on a data set about kiwi consumption for my
> bachelors project. The data is available at
> http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030
>
> I'm abit confused as to how to interpret the output parameters,
> specifically p0. I've run the following code:
>
> dataset<- read.table("OAS_RMA.txt",header=TRUE)
> controls<- cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dataset$CEL25.1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CEL78.1,dataset$CEL9.1,dataset$CEL92.1)
> experiments<- cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,dataset$CEL31.1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$CEL57.1,dataset$CEL7.1)
>
> library('siggenes')
> datamatrix<- matrix(cbind(controls,experiments),ncol=19)
> y<- rep(0,19)
> y[11:19]<- 1
> gene_names<- as.character(dataset$Hybridization.REF)
> sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345)
>
> Output:
> AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances
>
>   s0 = 0
>
>   Number of permutations: 100
>
>   MEAN number of falsely called variables is computed.
>
>     Delta    p0    False Called    FDR cutlow cutup   j2    j1
> 1    0.1 0.634 28335.89  37013 0.4851 -1.058 0.354 9709 27372
> 2    0.5 0.634 11200.82  21273 0.3336 -2.271 0.910 2447 35850
> 3    0.9 0.634   249.38   1522 0.1038 -3.374 3.088  541 53695
> 4    1.3 0.634     9.67    134 0.0457 -4.402 5.577  127 54669
> 5    1.7 0.634     0.69     20 0.0219 -5.596   Inf   20 54676
> 6    2.1 0.634        0      1      0 -9.072   Inf    1 54676
> 7    2.5 0.634        0      1      0 -9.072   Inf    1 54676
> 8    2.9 0.634        0      1      0 -9.072   Inf    1 54676
> 9    3.3 0.634        0      1      0 -9.072   Inf    1 54676
> 10   3.7 0.634        0      0      0   -Inf   Inf    0 54676
>
> I'm using the rand parameter because results seems to vary a bit. p0
> is in this case 0.634, and I'm not sure how to interpret this. From
> literature, this is described as "Prior probability that a gene is not
> differentially expressed" - What does this exactly mean? Does this
> imply, that there is a ~63% percent chance, that the genes in
> question, are actually NOT differentially expressed?

It means that about 63% of your genes appear to be not differentially 
expressed. So if you choose a gene at random, there is a 63% probability 
that you will choose one that isn't differentially expressed.

However, depending on the value of Delta that you choose, the 
expectation is that a far fewer percentage of the genes selected will be 
differentially expressed. In other words, you are trying to grab genes 
with a higher probability of differential expression, and you are then 
estimating what percentage of those genes are still likely false 
positives (e.g., if you choose a Delta of 1.3, you will get 134 
significant genes, and will expect that about 10 of those will be false 
positives).

> I've also found some varying sources saying that it is a good idea to
> log2 transform data before inputting into SAM. Does this still apply,
> and if so, why?

This is because the t-test is based on means, which are not very robust 
to outliers. Gene expression data tend to have a strong right skew, 
meaning that most of the data are within a certain range, but there are 
some values much higher. If you take logs, it tends to minimize the 
skew, so the large values have less of an effect (on the linear scale, 
expression values range from 0-65,000, on log2 scale, they range from 
0-16). It doesn't matter what base you use, but people have tended to 
use log base 2 because then a difference of 1 indicates a two-fold 
difference on the linear scale.

Best,

Jim

>
> Best Regards,
>
> David Westergaard
> Undergraduate student
> Technical University of Denmark
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor