[BioC] Siggenes SAM analysis: log2 transformation and Understanding output
James W. MacDonald
jmacdon at uw.edu
Thu Feb 16 17:05:38 CET 2012
Hi David,
On 2/15/2012 8:30 AM, David Westergaard wrote:
> Hello,
>
> I am currently working on a data set about kiwi consumption for my
> bachelors project. The data is available at
> http://www.ebi.ac.uk/arrayexpress/experiments/E-MEXP-2030
>
> I'm abit confused as to how to interpret the output parameters,
> specifically p0. I've run the following code:
>
> dataset<- read.table("OAS_RMA.txt",header=TRUE)
> controls<- cbind(dataset$CEL12.1,dataset$CEL13.1,dataset$CEL23.1,dataset$CEL25.1,dataset$CEL37.1,dataset$CEL59.1,dataset$CEL61.1,dataset$CEL78.1,dataset$CEL9.1,dataset$CEL92.1)
> experiments<- cbind(dataset$CEL18.1,dataset$CEL21.1,dataset$CEL3.1,dataset$CEL31.1,dataset$CEL46.1,dataset$CEL50.1,dataset$CEL56.1,dataset$CEL57.1,dataset$CEL7.1)
>
> library('siggenes')
> datamatrix<- matrix(cbind(controls,experiments),ncol=19)
> y<- rep(0,19)
> y[11:19]<- 1
> gene_names<- as.character(dataset$Hybridization.REF)
> sam.obj = sam(datamatrix,y,gene.names=gene_names,rand=12345)
>
> Output:
> AM Analysis for the Two-Class Unpaired Case Assuming Unequal Variances
>
> s0 = 0
>
> Number of permutations: 100
>
> MEAN number of falsely called variables is computed.
>
> Delta p0 False Called FDR cutlow cutup j2 j1
> 1 0.1 0.634 28335.89 37013 0.4851 -1.058 0.354 9709 27372
> 2 0.5 0.634 11200.82 21273 0.3336 -2.271 0.910 2447 35850
> 3 0.9 0.634 249.38 1522 0.1038 -3.374 3.088 541 53695
> 4 1.3 0.634 9.67 134 0.0457 -4.402 5.577 127 54669
> 5 1.7 0.634 0.69 20 0.0219 -5.596 Inf 20 54676
> 6 2.1 0.634 0 1 0 -9.072 Inf 1 54676
> 7 2.5 0.634 0 1 0 -9.072 Inf 1 54676
> 8 2.9 0.634 0 1 0 -9.072 Inf 1 54676
> 9 3.3 0.634 0 1 0 -9.072 Inf 1 54676
> 10 3.7 0.634 0 0 0 -Inf Inf 0 54676
>
> I'm using the rand parameter because results seems to vary a bit. p0
> is in this case 0.634, and I'm not sure how to interpret this. From
> literature, this is described as "Prior probability that a gene is not
> differentially expressed" - What does this exactly mean? Does this
> imply, that there is a ~63% percent chance, that the genes in
> question, are actually NOT differentially expressed?
It means that about 63% of your genes appear to be not differentially
expressed. So if you choose a gene at random, there is a 63% probability
that you will choose one that isn't differentially expressed.
However, depending on the value of Delta that you choose, the
expectation is that a far fewer percentage of the genes selected will be
differentially expressed. In other words, you are trying to grab genes
with a higher probability of differential expression, and you are then
estimating what percentage of those genes are still likely false
positives (e.g., if you choose a Delta of 1.3, you will get 134
significant genes, and will expect that about 10 of those will be false
positives).
> I've also found some varying sources saying that it is a good idea to
> log2 transform data before inputting into SAM. Does this still apply,
> and if so, why?
This is because the t-test is based on means, which are not very robust
to outliers. Gene expression data tend to have a strong right skew,
meaning that most of the data are within a certain range, but there are
some values much higher. If you take logs, it tends to minimize the
skew, so the large values have less of an effect (on the linear scale,
expression values range from 0-65,000, on log2 scale, they range from
0-16). It doesn't matter what base you use, but people have tended to
use log base 2 because then a difference of 1 indicates a two-fold
difference on the linear scale.
Best,
Jim
>
> Best Regards,
>
> David Westergaard
> Undergraduate student
> Technical University of Denmark
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list