[BioC] nsFilter and GSEA

Fri Jan 11 19:44:09 CET 2008

Dear Robert and BioC Mailing list,

The chips are Affymetrix Drosophila genome 1.0 (annotation drosgenome1).
I am even more confused: to make sure that was not my fault, I copied 
the .CEL files in a new directory, started a fresh R session from there 
and run *just* the following code. Same results:

 > library(affy)
Loading required package: Biobase
Loading required package: tools

Welcome to Bioconductor

   Vignettes contain introductory material. To view, type
   'openVignette()'. To cite Bioconductor, see
   'citation("Biobase")' and for packages 'citation(pkgname)'.

Loading required package: affyio
Loading required package: preprocessCore
 > mydata <- ReadAffy()
 > eset.rma <- rma(mydata)
Background correcting
Normalizing
Calculating Expression
 > eset.mas <- mas5(mydata)
background correction: mas
PM/MM correction : mas
expression values: mas
background correcting...done.
14010 ids to be processed
|                    |
|####################|
 > library(genefilter)
Loading required package: survival
Loading required package: splines
 > eset.rma.f <- nsFilter(eset.rma)
 > eset.mas.f <- nsFilter(eset.mas)
 > eset.rma.f
$eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 171 features, 15 samples
   element names: exprs
phenoData
   sampleNames: dta_2a.CEL, dta_2b.CEL, ..., virgin_4b.CEL  (15 total)
   varLabels and varMetadata description:
     sample: arbitrary numbering
featureData
   featureNames: 147260_at, 142359_at, ..., 145988_at  (171 total)
   fvarLabels and fvarMetadata description: none
experimentData: use 'experimentData(object)'
Annotation: drosgenome1

$filter.log
$filter.log$numDupsRemoved
[1] 3

$filter.log$numLowVar
[1] 13047

$filter.log$feature.exclude
[1] 3

$filter.log$numRemoved.ENTREZID
[1] 786

 > eset.mas.f
$eset
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12122 features, 15 samples
   element names: exprs, se.exprs
phenoData
   sampleNames: dta_2a.CEL, dta_2b.CEL, ..., virgin_4b.CEL  (15 total)
   varLabels and varMetadata description:
     sample: arbitrary numbering
featureData
   featureNames: 153135_at, 154994_at, ..., 152360_at  (12122 total)
   fvarLabels and fvarMetadata description: none
experimentData: use 'experimentData(object)'
Annotation: drosgenome1

$filter.log
$filter.log$numDupsRemoved
[1] 1098

$filter.log$numLowVar
[1] 1

$filter.log$feature.exclude
[1] 3

$filter.log$numRemoved.ENTREZID
[1] 786

 > sessionInfo()
R version 2.6.1 (2007-11-26)
i486-pc-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] splines   tools     stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
[1] drosgenome1_2.0.1    genefilter_1.16.0    survival_2.34
[4] drosgenome1cdf_2.0.0 affy_1.16.0          preprocessCore_1.0.0
[7] affyio_1.6.1         Biobase_1.16.1

loaded via a namespace (and not attached):
[1] annotate_1.16.1     AnnotationDbi_1.0.6 DBI_0.2-4
[4] rcompgen_0.1-17     RSQLite_0.6-4
 >

Could be the CEL files that are damaged?
Thanks,
best wishes,
Paolo

Robert Gentleman wrote:
> Hi,
>  It looks like something fairly odd is going on, and that we are not 
> seeing all of the code that is being run.
> 
>  What chip are you using?  What is very odd is that in your first 
> example 1098 "duplicate" probes are found, but in the second run only 3. 
> Basically this cannot happen (since the probes are the same) and 
> suggests that some piece of code has manipulated the names, and at that 
> point I think fairly bad things are going to happen. So this would be 
> one place to try and fix things.
> 
>  Second, nsFilter filters by default at the median, so you should retain 
> about 0.5 of your probe sets. But since you loose so many (you didn't 
> tell us the chip so I can't be sure) but it looks like all of the values 
> are corrupt for that example as well.
> 
>  So, I think that you are looking in the wrong place. Your problem is 
> probably earlier on.
> 
>  best wishes
>    Robert
> 
> 
> Paolo Innocenti wrote:
>> Hi again,
>>
>> I tried with a different normalisation method, and I was pretty 
>> surprised by the results:
>>
>>  > eset.mas <- mas5(mydata)
>> background correction: mas
>> PM/MM correction : mas
>> expression values: mas
>> background correcting...done.
>> 14010 ids to be processed
>> |                    |
>> |####################|
>>  > eset.mas.f <- nsFilter(eset.mas)
>>  > eset.mas.f$filter.log
>> $numDupsRemoved
>> [1] 1098
>>
>> $numLowVar
>> [1] 1
>>
>> $feature.exclude
>> [1] 3
>>
>> $numRemoved.ENTREZID
>> [1] 786
>>
>>  > eset.rma <- rma(mydata)
>> Background correcting
>> Normalizing
>> Calculating Expression
>>  > eset.rma.f <- nsFilter(eset.rma)
>>  > eset.rma.f$filter.log
>> $numDupsRemoved
>> [1] 3
>>
>> $numLowVar
>> [1] 13047
>>
>> $feature.exclude
>> [1] 3
>>
>> $numRemoved.ENTREZID
>> [1] 786
>>
>>  > dim(eset.rma.f$eset)
>> Features  Samples
>>       171       15
>>  > dim(eset.mas.f$eset)
>> Features  Samples
>>     12122       15
>>
>> I don't understand how is it possible. Any suggestion about what to 
>> do? Should I lower the cutoff for the rma, or that processing method 
>> doesn't work for my dataset?
>>
>> Paolo
>> PS: I tried also a really low cutoff, but the situation doesn't 
>> change, unless I choose a cutoff=0.1:
>>
>>  > eset.filter <- nsFilter(eset,var.cutoff=0.2)
>>  > eset.filter$filter.log
>> $numDupsRemoved
>> [1] 69
>>
>> $numLowVar
>> [1] 10560
>>
>> $feature.exclude
>> [1] 3
>>
>> $numRemoved.ENTREZID
>> [1] 786
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>