[BioC] normalisation + sam
Wolfgang Huber
whuber at embl.de
Fri Apr 23 20:47:08 CEST 2010
Hi Mike
I added another heatmap, not using your 97 "marker" genes, but 200
random ones instead. The clustering result for the samples is the same:
http://www-huber.embl.de/users/whuber/bioc-list/100423
Also, your fData(immgen) is ill-formatted. It should be a data.frame
with as many rows as features, instead it has as many columns, with one row:
> dim(immgen)
Features Samples
23532 128
> dim(fData(immgen))
[1] 23532 1
Best wishes
Wolfgang
Wolfgang Huber ha scritto:
>
>
> Hi Mike
>
> I am afraid it does look like a normalisation problem. siggenes is
> right, given the data. Have a look at the heatmap in the PDF file at
> http://www-huber.embl.de/users/whuber/bioc-list/100423/
> and the R script in the same directory, which derives from your code below.
>
> Best wishes
> Wolfgang
>
>
> Mike Dewar scripsit 23/04/10 17:11:
>> Hi Wolfgang,
>>
>> To get from the CEL files to a Bioconductor ExpressionSet is quite
>> longwinded and, as it's not using Bioconductor, maybe not something I
>> should abuse this list with. Instead, I've put the expression set I've
>> generated up online at
>> http://www.columbia.edu/~md2954/immgen.data
>>
>> Then, to generate the results I'm seeing, I use the code below. I'm
>> pretty new to R, too, so any general hints also welcome. In the code
>> below I'm interested in seeing which genes that have symbols like
>> "CD4" are differentially expressed. Note that if I just use all the
>> genes (and cut out the code that selects "^Cd\\d+" genes) then I still
>> get results implying that my genes are not distributed in the way sam
>> is expecting.
>> To precisize my question then, does the result of SAM, as expressed in
>> the below code, imply that my array data is badly normalised? If not,
>> does it imply some horrific misunderstanding on my part on how these
>> things should work?
>>
>> Mike
>>
>> library(siggenes)
>> library(Biobase)
>> # PUT THE FOLDER WHERE YOU STORED immgen.data, i.e.
>> #path = "/Users/mike/Data/Immgen/userData/GSE15907"
>> path = "."
>> # USE THE SEPARATOR APPROPRIATE FOR YOUR SYSTEM (must learn more about
>> this)
>> sep = "/"
>> # load the data
>> filename = "immgen.data"
>> load(paste(path,filename,sep=sep)) # loads a ExpressionSet called immgen
>> # form a vector which corresponds to the class of each array
>> p = pData(immgen)
>> cl = colnames(p)[apply(p, 1, which)]
>> # remove "phenotype" genes
>> # any gene that directly encodes a surface marker is highlighted
>> regexp = "^Cd\\d+"
>> marker_indices = grep(regexp,fData(immgen)$symbol)
>> markers = immgen[marker_indices,]
>> # class labels:
>> classes <- colnames(pData(markers))
>> # find out how many of each class we have
>> count <- lapply(pData(markers),sum)
>> # chuck out all the experiments with less than 4 examples. This will
>> leave just two classes. to_keep = as.logical(sapply(
>> as.data.frame(t(
>> # the next line chooses those phenotypes with more than n examples
>> # set n to 2 to leave 28 classes n = 3
>> pData(markers)[,classes[count>n]] )),
>> sum # the sum is for each row, hence the transposition above
>> ))
>> # now to_keep has a 1 next to each array we want and a 0 to those we
>> don't
>> # hence we can use to_keep to index the markers expression set
>> markers <- markers[,to_keep]
>> # and then chuck out classes we've rejected
>> cl <- cl[to_keep]
>> # then apply sam to the markers
>> sam.out = sam(data=markers,cl=cl,var.equal=FALSE)
>> print(sam.out, seq(0.1, 5, 0.1))
>> plot(sam.out, 0.5)
>>
>>
>>
>> On 22 Apr 2010, at 15:41, Wolfgang Huber wrote:
>>
>>> Hi Mike
>>>
>>> can you provide a reproducible example (R script) and the output of
>>> sessionInfo() for the 2-groups comparison? This would include a
>>> pointer to the 6(?) CEL files on the web.
>>>
>>> For the 28-classes case, I doubt that hypothesis testing of the null
>>> hypothesis "mean in all classes is the same" is the most useful data
>>> analytic approach. Perhaps you can precisize your question.
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>> Hi,
>>>> I'm trying to use siggenes - sam() to look at differentially
>>>> expressed genes of data taken from the Immunological Genome project
>>>> (http://www.immgen.org/). A problem with this is that I have to
>>>> perform the preprocessing of the original CEL files as they do not
>>>> make the processed data available on GEO. To do this I'm using the
>>>> aroma suite of packages to perform quantile normalisation of this
>>>> data set (so far 128 arrays) in fixed memory (i.e. my laptop). This
>>>> is a good thing, as it has forced me to learn a little about array
>>>> preprocessing, and a bad thing as I'm new to all this and might be
>>>> going horribly wrong. When it comes time to look for differentially
>>>> expressed genes, I'm using siggenes - sam() and I'm getting some
>>>> strange results. I'm using (what I think would be considered) many
>>>> classes (28), where each class has at least 3 examples, and thus
>>>> throwing out some of the arrays. The results I'm getting look like:
>>>> SAM Analysis for the Multi-Class Case with 28 Classes
>>>> Delta p0 False Called FDR
>>>> 1 0.1 0.014 23355.05 23532 0.013997
>>>> 2 0.2 0.014 21805.98 23498 0.013087
>>>> 3 0.3 0.014 15923.83 23169 0.009693
>>>> 4 0.4 0.014 9637.47 22343 0.006083
>>>> 5 0.5 0.014 5527.75 21330 0.003655
>>>> 6 0.6 0.014 3060.73 20221 0.002135
>>>> 7 0.7 0.014 1703.82 19205 0.001251
>>>> 8 0.8 0.014 953.02 18256 0.000736
>>>> 9 0.9 0.014 536.81 17382 0.000436
>>>> 10 1.0 0.014 307.19 16730 0.000259
>>>> 11 1.1 0.014 176 16143 0.000154
>>>> which I think implies that many many genes are differentially
>>>> expressed. Using plot(sam.out, 1.2) shows a pretty much vertical
>>>> line starting at the origin, with no genes observably behaving like
>>>> the null model. Even if I only try this on 2 classes, and hence
>>>> throwing out most of the data, I'm still not getting sensible results.
>>>> Now I'm hoping that I'm doing something wrong, and that not 16K of
>>>> my genes are differentially expressed. However, I'm having
>>>> difficulty figuring out what it might be. The one striking thing
>>>> between my data set and the golub example set is that golub seems to
>>>> be zero-mean - is this a requirement for sam()?
>>>> Any other ideas of what to look for, or what other information I
>>>> could provide to help this question make sense, would be greatly
>>>> appreciated.
>>>> Thanks in advance,
>>>> Mike Dewar
>>>> - - -
>>>> Dr Michael Dewar
>>>> Postdoctoral Research Scientist Applied Mathematics
>>>> Columbia University
>>>> http://www.columbia.edu/~md2954/
>>>> [[alternative HTML version deleted]]
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> --
>>>
>>>
>>> Wolfgang Huber
>>> EMBL
>>> http://www.embl.de/research/units/genome_biology/huber
>>>
>>>
>>>
>>
>> - - -
>> Dr Michael Dewar
>> **Postdoctoral Research Scientist Applied Mathematics
>> Columbia University
>> http://www.columbia.edu/~md2954/
>>
>>
>>
>>
>>
>>
>
>
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber
More information about the Bioconductor
mailing list