[BioC] normalisation + sam
Wolfgang Huber
whuber at embl.de
Fri Apr 23 18:58:26 CEST 2010
Hi Mike
I am afraid it does look like a normalisation problem. siggenes is
right, given the data. Have a look at the heatmap in the PDF file at
http://www-huber.embl.de/users/whuber/bioc-list/100423/
and the R script in the same directory, which derives from your code below.
Best wishes
Wolfgang
Mike Dewar scripsit 23/04/10 17:11:
> Hi Wolfgang,
>
> To get from the CEL files to a Bioconductor ExpressionSet is quite
> longwinded and, as it's not using Bioconductor, maybe not something I
> should abuse this list with. Instead, I've put the expression set I've
> generated up online at
>
> http://www.columbia.edu/~md2954/immgen.data
>
> Then, to generate the results I'm seeing, I use the code below. I'm
> pretty new to R, too, so any general hints also welcome. In the code
> below I'm interested in seeing which genes that have symbols like "CD4"
> are differentially expressed. Note that if I just use all the genes (and
> cut out the code that selects "^Cd\\d+" genes) then I still get results
> implying that my genes are not distributed in the way sam is expecting.
>
> To precisize my question then, does the result of SAM, as expressed in
> the below code, imply that my array data is badly normalised? If not,
> does it imply some horrific misunderstanding on my part on how these
> things should work?
>
> Mike
>
> library(siggenes)
> library(Biobase)
> # PUT THE FOLDER WHERE YOU STORED immgen.data, i.e.
> #path = "/Users/mike/Data/Immgen/userData/GSE15907"
> path = "."
> # USE THE SEPARATOR APPROPRIATE FOR YOUR SYSTEM (must learn more about this)
> sep = "/"
> # load the data
> filename = "immgen.data"
> load(paste(path,filename,sep=sep)) # loads a ExpressionSet called immgen
> # form a vector which corresponds to the class of each array
> p = pData(immgen)
> cl = colnames(p)[apply(p, 1, which)]
> # remove "phenotype" genes
> # any gene that directly encodes a surface marker is highlighted
> regexp = "^Cd\\d+"
> marker_indices = grep(regexp,fData(immgen)$symbol)
> markers = immgen[marker_indices,]
> # class labels:
> classes <- colnames(pData(markers))
> # find out how many of each class we have
> count <- lapply(pData(markers),sum)
> # chuck out all the experiments with less than 4 examples. This will
> leave just two classes.
> to_keep = as.logical(sapply(
> as.data.frame(t(
> # the next line chooses those phenotypes with more than n examples
> # set n to 2 to leave 28 classes
> n = 3
> pData(markers)[,classes[count>n]]
> )),
> sum # the sum is for each row, hence the transposition above
> ))
> # now to_keep has a 1 next to each array we want and a 0 to those we don't
> # hence we can use to_keep to index the markers expression set
> markers <- markers[,to_keep]
> # and then chuck out classes we've rejected
> cl <- cl[to_keep]
> # then apply sam to the markers
> sam.out = sam(data=markers,cl=cl,var.equal=FALSE)
> print(sam.out, seq(0.1, 5, 0.1))
> plot(sam.out, 0.5)
>
>
>
> On 22 Apr 2010, at 15:41, Wolfgang Huber wrote:
>
>> Hi Mike
>>
>> can you provide a reproducible example (R script) and the output of
>> sessionInfo() for the 2-groups comparison? This would include a
>> pointer to the 6(?) CEL files on the web.
>>
>> For the 28-classes case, I doubt that hypothesis testing of the null
>> hypothesis "mean in all classes is the same" is the most useful data
>> analytic approach. Perhaps you can precisize your question.
>>
>> Best wishes
>> Wolfgang
>>
>>> Hi,
>>> I'm trying to use siggenes - sam() to look at differentially
>>> expressed genes of data taken from the Immunological Genome project
>>> (http://www.immgen.org/). A problem with this is that I have to
>>> perform the preprocessing of the original CEL files as they do not
>>> make the processed data available on GEO. To do this I'm using the
>>> aroma suite of packages to perform quantile normalisation of this
>>> data set (so far 128 arrays) in fixed memory (i.e. my laptop). This
>>> is a good thing, as it has forced me to learn a little about array
>>> preprocessing, and a bad thing as I'm new to all this and might be
>>> going horribly wrong. When it comes time to look for differentially
>>> expressed genes, I'm using siggenes - sam() and I'm getting some
>>> strange results. I'm using (what I think would be considered) many
>>> classes (28), where each class has at least 3 examples, and thus
>>> throwing out some of the arrays. The results I'm getting look like:
>>> SAM Analysis for the Multi-Class Case with 28 Classes
>>> Delta p0 False Called FDR
>>> 1 0.1 0.014 23355.05 23532 0.013997
>>> 2 0.2 0.014 21805.98 23498 0.013087
>>> 3 0.3 0.014 15923.83 23169 0.009693
>>> 4 0.4 0.014 9637.47 22343 0.006083
>>> 5 0.5 0.014 5527.75 21330 0.003655
>>> 6 0.6 0.014 3060.73 20221 0.002135
>>> 7 0.7 0.014 1703.82 19205 0.001251
>>> 8 0.8 0.014 953.02 18256 0.000736
>>> 9 0.9 0.014 536.81 17382 0.000436
>>> 10 1.0 0.014 307.19 16730 0.000259
>>> 11 1.1 0.014 176 16143 0.000154
>>> which I think implies that many many genes are differentially
>>> expressed. Using plot(sam.out, 1.2) shows a pretty much vertical line
>>> starting at the origin, with no genes observably behaving like the
>>> null model. Even if I only try this on 2 classes, and hence throwing
>>> out most of the data, I'm still not getting sensible results.
>>> Now I'm hoping that I'm doing something wrong, and that not 16K of my
>>> genes are differentially expressed. However, I'm having difficulty
>>> figuring out what it might be. The one striking thing between my data
>>> set and the golub example set is that golub seems to be zero-mean -
>>> is this a requirement for sam()?
>>> Any other ideas of what to look for, or what other information I
>>> could provide to help this question make sense, would be greatly
>>> appreciated.
>>> Thanks in advance,
>>> Mike Dewar
>>> - - -
>>> Dr Michael Dewar
>>> Postdoctoral Research Scientist Applied Mathematics
>>> Columbia University
>>> http://www.columbia.edu/~md2954/
>>> [[alternative HTML version deleted]]
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> --
>>
>>
>> Wolfgang Huber
>> EMBL
>> http://www.embl.de/research/units/genome_biology/huber
>>
>>
>>
>
> - - -
> Dr Michael Dewar
> **Postdoctoral Research Scientist
> Applied Mathematics
> Columbia University
> http://www.columbia.edu/~md2954/
>
>
>
>
>
>
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list