[BioC] question subsetting expressionSet
Martin Morgan
mtmorgan at fhcrc.org
Wed Jan 11 12:51:04 CET 2012
On 01/11/2012 02:07 AM, Hooiveld, Guido wrote:
> Hi,
>
> Just to confirm I am doing things properly:
> I have created an expressionSet from a set of 120 Affymetrix arrays and I also added some metadata (phenoData) to that expressionSet. This all goes OK. Now I would like to subset the expressionSet based on one of the variables described in the phenoData.
> Although I am able to 'extract' the proper arrays, I noticed something unexpected when looking at the phenoData of the new, subset object; the phenoData slot that has been used to subset *seems* to still have 3 levels, whereas I expect only one level. This behaviour also occurs for the other variables of the phenoData dataframe (i.e. more levels are reported than are present). To be sure, can anyone explain if this is to be expected, or whether I do something wrong?
Hi Guido -- this is how R factors work
> g = f[f=="M"]
> g
[1] M
Levels: F M
You could recast the factor, e.g.,
> factor(g)
[1] M
Levels: M
or be satisfied that R is keeping track of important aspects of your
original data.
(in your script below, x.norm2 <- x.norm[,x.norm$Diet=="chow"] might
have been more natural, rather than making a copy and subsetting the copy).
Hope that helps,
Martin
> Thanks,
> Guido
>
>
> # read data& normalize
>> library(affyPLM)
>
>> pheno<- read.delim(file="A213_metadata.txt", row.names=1)
>> affy.data<- ReadAffy(cdfname="mogene11stv1mmentrezg", phenoData=as.data.frame(pheno))
>>
>> # check
>> validObject(affy.data)
> [1] TRUE
>>
>> # normalize
>> x.norm<- fitPLM(affy.data)
>> # convert PLMset to eSet!
>> x.norm<- pset2eset(x.norm)
>> #check
>> validObject(x.norm)
> [1] TRUE
>>
>> x.norm
> ExpressionSet (storageMode: lockedEnvironment)
> assayData: 21225 features, 120 samples
> element names: exprs, se.exprs
> protocolData: none
> phenoData
> sampleNames: G014_A05_01_801_I1_chow.CEL G014_A07_09_809_I1_HF.CEL
> ... G020_H12_120_824_I10_HF.CEL (120 total)
> varLabels: Simil Diet ... Labeling (5 total)
> varMetadata: labelDescription
> featureData: none
> experimentData: use 'experimentData(object)'
> Annotation: mogene11stv1mmentrezg
>
>> dim(x.norm)
> Features Samples
> 21225 120
>>
>> #Check Diet assignment
>> x.norm$Diet
> [1] chow hfd lfd chow hfd lfd chow hfd lfd chow hfd lfd lfd chow hfd
> [16] lfd chow hfd lfd chow hfd lfd chow hfd chow chow chow chow lfd lfd
> [31] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd
> [46] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd
> [61] chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow
> [76] chow lfd lfd lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd
> [91] lfd lfd hfd hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd
> [106] hfd hfd hfd chow chow chow chow lfd lfd lfd lfd hfd hfd hfd hfd
> Levels: chow hfd lfd
>>
>> str(x.norm)
> <<snip>
> ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
> .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable:
> .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ...
> .. .. ..@ data :'data.frame': 120 obs. of 5 variables:
> .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 3 1 1 3 1 1 3 1 ...
> .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 2 3 1 2 3 1 2 3 1 ...
> .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 5 9 4 5 9 4 5 9 4 ...
> .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 1 1 ...
> .. .. .. ..$ Labeling: int [1:120] 3 1 2 1 1 2 1 1 2 1 ...
>
> So far, so good.
> Now I would like to extract data of only the 40 chow samples by subsetting x.norm on variable 'Diet'.
>
>> # backup x.norm
>> x.norm2<- x.norm
>>
>> #subset only chow samples
>> x.norm<- x.norm2[,x.norm2$Diet=="chow"]
>> dim(x.norm)
> Features Samples
> 21225 40
>
> Subsetting samples seem to go OK...
>> #Again check Diet assigment
>> x.norm$Diet
> [1] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow
> [16] chow chow chow chow chow chow chow chow chow chow chow chow chow chow chow
> [31] chow chow chow chow chow chow chow chow chow chow
> Levels: chow hfd lfd
>>
>
> ^^ why are there still 3 levels; i expected only one level, namely "chow"
>
>> str(x.norm)
> <<snip>
> ..@ phenoData :Formal class 'AnnotatedDataFrame' [package "Biobase"] with 4 slots
> .. .. ..@ varMetadata :'data.frame': 5 obs. of 1 variable:
> .. .. .. ..$ labelDescription: chr [1:5] NA NA NA NA ...
> .. .. ..@ data :'data.frame': 40 obs. of 5 variables:
> .. .. .. ..$ Simil : Factor w/ 10 levels "i1","i10","i2",..: 1 1 1 1 3 3 3 3 4 4 ...
> .. .. .. ..$ Diet : Factor w/ 3 levels "chow","hfd","lfd": 1 1 1 1 1 1 1 1 1 1 ...
> .. .. .. ..$ Group : Factor w/ 30 levels "i10_chow","i10_hfd",..: 4 4 4 4 7 7 7 7 10 10 ...
> .. .. .. ..$ Plate : Factor w/ 2 levels "G014","G020": 1 1 1 1 1 1 1 1 2 2 ...
> .. .. .. ..$ Labeling: int [1:40] 3 1 1 1 2 2 2 2 3 3 ...
>
> ^^ idem, why are for all variables the 'original' levels reported and not the subset ones?
>
>
>
>
>> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] SpeCond_1.8.0 RColorBrewer_1.0-5
> [3] hwriter_1.3 fields_6.6.3
> [5] spam_0.27-0 mclust_3.4.11
> [7] mogene11stv1mmentrezgcdf_14.1.0 affyPLM_1.30.0
> [9] preprocessCore_1.16.0 gcrma_2.26.0
> [11] affy_1.33.2 Biobase_2.14.0
> [13] BiocGenerics_0.1.3
>
> loaded via a namespace (and not attached):
> [1] affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0
> [4] IRanges_1.12.5 splines_2.14.0 tools_2.14.0
> [7] zlibbioc_1.0.0
>>
>
>
> ---------------------------------------------------------
> Guido Hooiveld, PhD
> Nutrition, Metabolism& Genomics Group
> Division of Human Nutrition
> Wageningen University
> Biotechnion, Bomenweg 2
> NL-6703 HD Wageningen
> the Netherlands
> tel: (+)31 317 485788
> fax: (+)31 317 483342
> email: guido.hooiveld at wur.nl
> internet: http://nutrigene.4t.com
> http://scholar.google.com/citations?user=qFHaMnoAAAAJ
> http://www.researcherid.com/rid/F-4912-2010
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the Bioconductor
mailing list