[BioC] justRMA (package Affy) messing up phenoData when passing an AnnotatedDataFrame
James W. MacDonald
jmacdon at uw.edu
Wed Apr 2 19:58:37 CEST 2014
Hi Arnaud,
Thanks for the bug report!
This is fixed in the devel version, which will be released in the next
week or so.
Note that we cannot do much comparisons to ensure that the phenoData
row.names make sense. In addition, the row.names for the phenoData
object take precedence over filenames when creating the resulting
ExpressionSet. Therefore, if we can't do simple matching between the
row.names of the phenoData object, we just take you at your word that
these are correctly ordered. Here is an example, using some random data:
> df
x y z
Sample_1 1 Low A
Sample_2 2 High B
Sample_3 3 Low C
Sample_4 4 High D
Sample_5 5 Low E
Sample_6 6 High F
> list.celfiles()
[1] "GSM38345.CEL.gz" "GSM38346.CEL.gz" "GSM38347.CEL.gz" "GSM38348.CEL.gz"
[5] "GSM38349.CEL.gz" "GSM38350.CEL.gz"
> eset <- justRMA(phenoData = df)
Warning messages:
1: The cel file names are not conformable to the sample names provided
in the phenoData object.
Note that the sample names take precedence, so will be used in the
sampleNames slot of the resulting ExpressionSet.
Please ensure that the ordering of the phenoData object matches the
order of the file names (use read.celfiles() to test), as no re-ordering
is possible.
So here, the row.names are Sample_1 through Sample_6, which are
obviously very different from the GSM numbers for the celfiles. So we
just assume that the sample names we get from row.names(df) are in the
same order (and match with) what we get from list.celfiles().
> sampleNames(eset)
[1] "Sample_1" "Sample_2" "Sample_3" "Sample_4" "Sample_5" "Sample_6"
> pData(phenoData(eset))
x y z
Sample_1 1 Low A
Sample_2 2 High B
Sample_3 3 Low C
Sample_4 4 High D
Sample_5 5 Low E
Sample_6 6 High F
So if the phenoData object is ordered incorrectly (e.g., if for instance
Sample_1 is actually GSM38346.CEL.gz, or whatever), then the
ExpressionSet will be borked as well, but there isn't a way to
programmatically figure that out.
But if the row.names are conformable with the file names, then we read
in the files in the same order:
> row.names(df) <- rev(list.celfiles())
> eset2 <- justRMA(phenoData = df)
> sampleNames(eset2)
[1] "GSM38350.CEL.gz" "GSM38349.CEL.gz" "GSM38348.CEL.gz" "GSM38347.CEL.gz"
[5] "GSM38346.CEL.gz" "GSM38345.CEL.gz"
> pData(phenoData(eset2))
x y z
GSM38350.CEL.gz 1 Low A
GSM38349.CEL.gz 2 High B
GSM38348.CEL.gz 3 Low C
GSM38347.CEL.gz 4 High D
GSM38346.CEL.gz 5 Low E
GSM38345.CEL.gz 6 High F
> all.equal(exprs(eset), exprs(eset2)[,6:1], check.attributes = FALSE)
[1] TRUE
Best,
Jim
On 4/2/2014 11:20 AM, Arnaud wrote:
> Hello to all,
>
> I think there is a problem with a certain call of justRMA (see below); it mixes the sample names of pData but keeps the annotation the same, effectively miss labeling samples without returning a single error or warning.
>
>> phenoData <- read.AnnotatedDataFrame(paste(prodir, "sample_pheno.txt", sep="/"), sep="\t")
>> head(pData(phenoData))
> Line Resistance Media Condition
> 28_MGH006_neg.CEL MGH006 Naive D10 Cont
> 30_MGH006_criz.CEL MGH006 Naive D10 Criz 300
> 21_MGH010_neg.CEL MGH010 Criz R10 Cont
> 15_MGH010_Criz.CEL MGH010 Criz R10 Criz 300
> 9_MGH021_neg.CEL MGH021-2 (pool) Criz (1269 et al) A10 Cont
> 4_MGH021_LDK.CEL MGH021-2 (pool) Criz (1269 et al) A10 LDK 300
>
>
>> eset <- justRMA(phenoData=phenoData, celfile.path=celdir)
>> head(pData(eset))
> Line Resistance Media Condition
> 10_MGH025_neg.CEL MGH006 Naive D10 Cont
> 11_MGH022_neg.CEL MGH006 Naive D10 Criz 300
> 12_MGH_049_criz.CEL MGH010 Criz R10 Cont
> 13_MGH025_criz_0530.CEL MGH010 Criz R10 Criz 300
> 14_MGH025_LDK.CEL MGH021-2 (pool) Criz (1269 et al) A10 Cont
> 15_MGH010_Criz.CEL MGH021-2 (pool) Criz (1269 et al) A10 LDK 300
>
> ## IN CONTRAST, THIS WORKS CORRECTLY:
> eset <- justRMA(phenoData=paste(prodir, "sample_pheno.txt", sep="/"), celfile.path=celdir)
>
>
> I hope people appreciate this might lead to completely wrong results, and that it should be addressed in some way by the authors ?
>
> Best regards,
>
> Arnaud Amzallag, PhD
> Center for Cancer Research, Massachusetts General Hospital/Harvard Medical School
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] hgu133plus2cdf_2.12.0 AnnotationDbi_1.22.6 affy_1.38.1
> [4] Biobase_2.20.1 BiocGenerics_0.6.0 rj_1.1.3-1
>
> loaded via a namespace (and not attached):
> [1] BiocInstaller_1.10.4 DBI_0.2-7 IRanges_1.18.4
> [4] RSQLite_0.11.4 affyio_1.28.0 preprocessCore_1.22.0
> [7] rj.gd_1.1.3-1 stats4_3.0.1 tools_3.0.1
> [10] zlibbioc_1.6.0
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list