[BioC] normalisation + sam

Fri Apr 23 20:47:08 CEST 2010

Hi Mike

I added another heatmap, not using your 97 "marker" genes, but 200 
random ones instead. The clustering result for the samples is the same:
	 http://www-huber.embl.de/users/whuber/bioc-list/100423

Also, your fData(immgen) is ill-formatted. It should be a data.frame 
with as many rows as features, instead it has as many columns, with one row:
 > dim(immgen)
Features  Samples
    23532      128
 > dim(fData(immgen))
[1] 23532     1

	Best wishes
	Wolfgang


Wolfgang Huber ha scritto:
> 
> 
> Hi Mike
> 
> I am afraid it does look like a normalisation problem. siggenes is 
> right, given the data. Have a look at the heatmap in the PDF file at
>   http://www-huber.embl.de/users/whuber/bioc-list/100423/
> and the R script in the same directory, which derives from your code below.
> 
>     Best wishes
>     Wolfgang
> 
> 
> Mike Dewar scripsit 23/04/10 17:11:
>> Hi Wolfgang,
>>
>> To get from the CEL files to a Bioconductor ExpressionSet is quite 
>> longwinded and, as it's not using Bioconductor, maybe not something I 
>> should abuse this list with. Instead, I've put the expression set I've 
>> generated up online at
>> http://www.columbia.edu/~md2954/immgen.data
>>
>> Then, to generate the results I'm seeing, I use the code below. I'm 
>> pretty new to R, too, so any general hints also welcome. In the code 
>> below I'm interested in seeing which genes that have symbols like 
>> "CD4" are differentially expressed. Note that if I just use all the 
>> genes (and cut out the code that selects "^Cd\\d+" genes) then I still 
>> get results implying that my genes are not distributed in the way sam 
>> is expecting.
>> To precisize my question then, does the result of SAM, as expressed in 
>> the below code, imply that my array data is badly normalised? If not, 
>> does it imply some horrific misunderstanding on my part on how these 
>> things should work?
>>
>> Mike
>>
>> library(siggenes)
>> library(Biobase)
>> # PUT THE FOLDER WHERE YOU STORED immgen.data, i.e.
>> #path = "/Users/mike/Data/Immgen/userData/GSE15907"
>> path = "."
>> # USE THE SEPARATOR APPROPRIATE FOR YOUR SYSTEM (must learn more about 
>> this)
>> sep = "/"
>> # load the data
>> filename = "immgen.data"
>> load(paste(path,filename,sep=sep)) # loads a ExpressionSet called immgen
>> # form a vector which corresponds to the class of each array
>> p = pData(immgen)
>> cl = colnames(p)[apply(p, 1, which)]
>> # remove "phenotype" genes
>> # any gene that directly encodes a surface marker is highlighted
>> regexp = "^Cd\\d+"
>> marker_indices = grep(regexp,fData(immgen)$symbol)
>> markers = immgen[marker_indices,]
>> # class labels:
>> classes <- colnames(pData(markers))
>> # find out how many of each class we have
>> count <- lapply(pData(markers),sum)
>> # chuck out all the experiments with less than 4 examples. This will 
>> leave just two classes. to_keep = as.logical(sapply(
>> as.data.frame(t(
>> # the next line chooses those phenotypes with more than n examples
>> # set n to 2 to leave 28 classes n = 3
>> pData(markers)[,classes[count>n]] )),
>> sum # the sum is for each row, hence the transposition above
>> ))
>> # now to_keep has a 1 next to each array we want and a 0 to those we 
>> don't
>> # hence we can use to_keep to index the markers expression set
>> markers <- markers[,to_keep]
>> # and then chuck out classes we've rejected
>> cl <- cl[to_keep]
>> # then apply sam to the markers
>> sam.out = sam(data=markers,cl=cl,var.equal=FALSE)
>> print(sam.out, seq(0.1, 5, 0.1))
>> plot(sam.out, 0.5)
>>
>>
>>
>> On 22 Apr 2010, at 15:41, Wolfgang Huber wrote:
>>
>>> Hi Mike
>>>
>>> can you provide a reproducible example (R script) and the output of 
>>> sessionInfo() for the 2-groups comparison? This would include a 
>>> pointer to the 6(?) CEL files on the web.
>>>
>>> For the 28-classes case, I doubt that hypothesis testing of the null 
>>> hypothesis "mean in all classes is the same" is the most useful data 
>>> analytic approach. Perhaps you can precisize your question.
>>>
>>> Best wishes
>>> Wolfgang
>>>
>>>> Hi,
>>>> I'm trying to use siggenes - sam() to look at differentially 
>>>> expressed genes of data taken from the Immunological Genome project 
>>>> (http://www.immgen.org/). A problem with this is that I have to 
>>>> perform the preprocessing of the original CEL files as they do not 
>>>> make the processed data available on GEO. To do this I'm using the 
>>>> aroma suite of packages to perform quantile normalisation of this 
>>>> data set (so far 128 arrays) in fixed memory (i.e. my laptop). This 
>>>> is a good thing, as it has forced me to learn a little about array 
>>>> preprocessing, and a bad thing as I'm new to all this and might be 
>>>> going horribly wrong. When it comes time to look for differentially 
>>>> expressed genes, I'm using siggenes - sam() and I'm getting some 
>>>> strange results. I'm using (what I think would be considered) many 
>>>> classes (28), where each class has at least 3 examples, and thus 
>>>> throwing out some of the arrays. The results I'm getting look like:
>>>> SAM Analysis for the Multi-Class Case with 28 Classes
>>>>   Delta    p0    False Called      FDR
>>>> 1    0.1 0.014 23355.05  23532 0.013997
>>>> 2    0.2 0.014 21805.98  23498 0.013087
>>>> 3    0.3 0.014 15923.83  23169 0.009693
>>>> 4    0.4 0.014  9637.47  22343 0.006083
>>>> 5    0.5 0.014  5527.75  21330 0.003655
>>>> 6    0.6 0.014  3060.73  20221 0.002135
>>>> 7    0.7 0.014  1703.82  19205 0.001251
>>>> 8    0.8 0.014   953.02  18256 0.000736
>>>> 9    0.9 0.014   536.81  17382 0.000436
>>>> 10   1.0 0.014   307.19  16730 0.000259
>>>> 11   1.1 0.014      176  16143 0.000154
>>>> which I think implies that many many genes are differentially 
>>>> expressed. Using plot(sam.out, 1.2) shows a pretty much vertical 
>>>> line starting at the origin, with no genes observably behaving like 
>>>> the null model. Even if I only try this on 2 classes, and hence 
>>>> throwing out most of the data, I'm still not getting sensible results.
>>>> Now I'm hoping that I'm doing something wrong, and that not 16K of 
>>>> my genes are differentially expressed. However, I'm having 
>>>> difficulty figuring out what it might be. The one striking thing 
>>>> between my data set and the golub example set is that golub seems to 
>>>> be zero-mean - is this a requirement for sam()?
>>>> Any other ideas of what to look for, or what other information I 
>>>> could provide to help this question make sense, would be greatly 
>>>> appreciated.
>>>> Thanks in advance,
>>>> Mike Dewar
>>>> - - -
>>>> Dr Michael Dewar
>>>> Postdoctoral Research Scientist Applied Mathematics
>>>> Columbia University
>>>> http://www.columbia.edu/~md2954/
>>>> [[alternative HTML version deleted]]
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch <mailto:Bioconductor at stat.math.ethz.ch>
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: 
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>> -- 
>>>
>>>
>>> Wolfgang Huber
>>> EMBL
>>> http://www.embl.de/research/units/genome_biology/huber
>>>
>>>
>>>
>>
>> - - -
>> Dr Michael Dewar
>> **Postdoctoral Research Scientist Applied Mathematics
>> Columbia University
>> http://www.columbia.edu/~md2954/
>>
>>
>>
>>
>>
>>
> 
> 

-- 


Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber