[BioC] question subsetting expressionSet

Wed Jan 11 12:51:04 CET 2012

On 01/11/2012 02:07 AM, Hooiveld, Guido wrote:
> Hi,
>
> Just to confirm I am doing things properly:
> I have created an expressionSet from a set of 120 Affymetrix arrays and I also added some metadata (phenoData) to that expressionSet. This all goes OK. Now I would like to subset the expressionSet based on one of the variables described in the phenoData.
> Although I am able to 'extract' the proper arrays, I noticed something unexpected when looking at the phenoData of the new, subset object; the phenoData slot that has been used to subset *seems* to still have 3 levels, whereas I expect only one level. This behaviour also occurs for the other variables of the phenoData dataframe (i.e. more levels are reported than are present). To be sure, can anyone explain if this is to be expected, or whether I do something wrong?

Hi Guido -- this is how R factors work

 > g = f[f=="M"]
 > g
[1] M
Levels: F M

You could recast the factor, e.g.,

> factor(g)
[1] M
Levels: M

or be satisfied that R is keeping track of important aspects of your 
original data.

(in your script below, x.norm2 <- x.norm[,x.norm$Diet=="chow"] might 
have been more natural, rather than making a copy and subsetting the copy).

Hope that helps,

Martin

> Thanks,
> Guido
>
>
> # read data&  normalize
>> library(affyPLM)
>
>> pheno<- read.delim(file="A213_metadata.txt", row.names=1)
>> affy.data<- ReadAffy(cdfname="mogene11stv1mmentrezg", phenoData=as.data.frame(pheno))
>>
>> # check
>> validObject(affy.data)
> [1] TRUE
>>
>> # normalize
>> x.norm<- fitPLM(affy.data)
>> # convert PLMset to eSet!
>> x.norm<- pset2eset(x.norm)
>> #check
>> validObject(x.norm)
> [1] TRUE
>>
>> x.norm
> ExpressionSet (storageMode: lockedEnvironment)
> assayData: 21225 features, 120 samples
>    element names: exprs, se.exprs
> protocolData: none
> phenoData
>    sampleNames: G014_A05_01_801_I1_chow.CEL G014_A07_09_809_I1_HF.CEL
>      ... G020_H12_120_824_I10_HF.CEL (120 total)
>    varLabels: Simil Diet ... Labeling (5 total)
>    varMetadata: labelDescription
> featureData: none
> experimentData: use 'experimentData(object)'
> Annotation: mogene11stv1mmentrezg
>
>> dim(x.norm)
> Features  Samples
>     21225      120
>>
>> #Check Diet assignment
>> x.norm$Diet
>    [1] chow hfd  lfd  chow hfd  lfd  chow hfd  lfd  chow hfd  lfd  lfd  chow hfd
>   [16] lfd  chow hfd  lfd  chow hfd  lfd  chow hfd  chow chow chow chow lfd  lfd
>   [31] lfd  lfd  hfd  hfd  hfd  hfd  chow chow chow chow lfd  lfd  lfd  lfd  hfd
>   [46] hfd  hfd  hfd  chow chow chow chow lfd  lfd  lfd  lfd  hfd  hfd  hfd  hfd
>   [61] chow chow chow chow lfd  lfd  lfd  lfd  hfd  hfd  hfd  hfd  chow chow chow
> [76] chow lfd  lfd  lfd  lfd  hfd  hfd  hfd  hfd  chow chow chow chow lfd  lfd
>   [91] lfd  lfd  hfd  hfd  hfd  hfd  chow chow chow chow lfd  lfd  lfd  lfd  hfd
> [106] hfd  hfd  hfd  chow chow chow chow lfd  lfd  lfd  lfd  hfd  hfd  hfd  hfd
> Levels: chow hfd lfd
>>
>> str(x.norm)
> <<snip>
>    ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
>    .. .. ..@ varMetadata      :'data.frame':     5 obs. of  1 variable:
>    .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ...
>    .. .. ..@ data             :'data.frame':     120 obs. of  5 variables:
>    .. .. .. ..$ Simil   : Factor w/ 10 levels "i1","i10","i2",..: 1 1 3 1 1 3 1 1 3 1 ...
>    .. .. .. ..$ Diet    : Factor w/ 3 levels "chow","hfd","lfd": 1 2 3 1 2 3 1 2 3 1 ...
>    .. .. .. ..$ Group   : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 5 9 4 5 9 4 5 9 4 ...
>    .. .. .. ..$ Plate   : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 1 1 ...
>    .. .. .. ..$ Labeling: int [1:120] 3 1 2 1 1 2 1 1 2 1 ...
>
> So far, so good.
> Now I would like to extract data of only the 40 chow samples by subsetting x.norm on variable 'Diet'.
>
>> # backup x.norm
>> x.norm2<- x.norm
>>
>> #subset only chow samples
>> x.norm<- x.norm2[,x.norm2$Diet=="chow"]
>> dim(x.norm)
> Features  Samples
>     21225       40
>
> Subsetting samples seem to go OK...
>> #Again check Diet assigment
>> x.norm$Diet
> [1] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow
> [16] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow
> [31] chow chow chow chow chow chow chow chow chow chow
> Levels: chow hfd lfd
>>
>
> ^^ why are there still 3 levels; i expected only one level, namely "chow"
>
>> str(x.norm)
> <<snip>
>    ..@ phenoData        :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
>    .. .. ..@ varMetadata      :'data.frame':     5 obs. of  1 variable:
>    .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ...
>    .. .. ..@ data             :'data.frame':     40 obs. of  5 variables:
>    .. .. .. ..$ Simil   : Factor w/ 10 levels "i1","i10","i2",..: 1 1 1 1 3 3 3 3 4 4 ...
>    .. .. .. ..$ Diet    : Factor w/ 3 levels "chow","hfd","lfd": 1 1 1 1 1 1 1 1 1 1 ...
>    .. .. .. ..$ Group   : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 4 4 4 7 7 7 7 10 10 ...
>    .. .. .. ..$ Plate   : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 2 2 ...
>    .. .. .. ..$ Labeling: int [1:40] 3 1 1 1 2 2 2 2 3 3 ...
>
> ^^ idem, why are for all variables the 'original' levels reported and not the subset ones?
>
>
>
>
>> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] SpeCond_1.8.0                   RColorBrewer_1.0-5
>   [3] hwriter_1.3                     fields_6.6.3
>   [5] spam_0.27-0                     mclust_3.4.11
>   [7] mogene11stv1mmentrezgcdf_14.1.0 affyPLM_1.30.0
>   [9] preprocessCore_1.16.0           gcrma_2.26.0
> [11] affy_1.33.2                     Biobase_2.14.0
> [13] BiocGenerics_0.1.3
>
> loaded via a namespace (and not attached):
> [1] affyio_1.22.0       BiocInstaller_1.2.1 Biostrings_2.22.0
> [4] IRanges_1.12.5      splines_2.14.0      tools_2.14.0
> [7] zlibbioc_1.0.0
>>
>
>
> ---------------------------------------------------------
> Guido Hooiveld, PhD
> Nutrition, Metabolism&  Genomics Group
> Division of Human Nutrition
> Wageningen University
> Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> email:      guido.hooiveld at wur.nl
> internet:   http://nutrigene.4t.com
> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
> http://www.researcherid.com/rid/F-4912-2010
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793