[BioC] justRMA (package Affy) messing up phenoData when passing an AnnotatedDataFrame

Wed Apr 2 19:58:37 CEST 2014

Hi Arnaud,

Thanks for the bug report!

This is fixed in the devel version, which will be released in the next 
week or so.

Note that we cannot do much comparisons to ensure that the phenoData 
row.names make sense. In addition, the row.names for the phenoData 
object take precedence over filenames when creating the resulting 
ExpressionSet. Therefore, if we can't do simple matching between the 
row.names of the phenoData object, we just take you at your word that 
these are correctly ordered. Here is an example, using some random data:

 > df
          x    y z
Sample_1 1  Low A
Sample_2 2 High B
Sample_3 3  Low C
Sample_4 4 High D
Sample_5 5  Low E
Sample_6 6 High F
 > list.celfiles()
[1] "GSM38345.CEL.gz" "GSM38346.CEL.gz" "GSM38347.CEL.gz" "GSM38348.CEL.gz"
[5] "GSM38349.CEL.gz" "GSM38350.CEL.gz"
 > eset <- justRMA(phenoData = df)

Warning messages:
1: The cel file names are not conformable to the sample names provided 
in the phenoData object.
Note that the sample names take precedence, so will be used in the 
sampleNames slot of the resulting ExpressionSet.
Please ensure that the ordering of the phenoData object matches the 
order of the file names (use read.celfiles() to test), as no re-ordering 
is possible.

So here, the row.names are Sample_1 through Sample_6, which are 
obviously very different from the GSM numbers for the celfiles. So we 
just assume that the sample names we get from row.names(df) are in the 
same order (and match with) what we get from list.celfiles().

 > sampleNames(eset)
[1] "Sample_1" "Sample_2" "Sample_3" "Sample_4" "Sample_5" "Sample_6"
 > pData(phenoData(eset))
          x    y z
Sample_1 1  Low A
Sample_2 2 High B
Sample_3 3  Low C
Sample_4 4 High D
Sample_5 5  Low E
Sample_6 6 High F

So if the phenoData object is ordered incorrectly (e.g., if for instance 
Sample_1 is actually GSM38346.CEL.gz, or whatever), then the 
ExpressionSet will be borked as well, but there isn't a way to 
programmatically figure that out.

But if the row.names are conformable with the file names, then we read 
in the files in the same order:

 > row.names(df) <- rev(list.celfiles())
 > eset2 <- justRMA(phenoData = df)
 > sampleNames(eset2)
[1] "GSM38350.CEL.gz" "GSM38349.CEL.gz" "GSM38348.CEL.gz" "GSM38347.CEL.gz"
[5] "GSM38346.CEL.gz" "GSM38345.CEL.gz"
 > pData(phenoData(eset2))
                 x    y z
GSM38350.CEL.gz 1  Low A
GSM38349.CEL.gz 2 High B
GSM38348.CEL.gz 3  Low C
GSM38347.CEL.gz 4 High D
GSM38346.CEL.gz 5  Low E
GSM38345.CEL.gz 6 High F
 > all.equal(exprs(eset), exprs(eset2)[,6:1], check.attributes = FALSE)
[1] TRUE

Best,

Jim

On 4/2/2014 11:20 AM, Arnaud wrote:
> Hello to all,
>
> I think there is a problem with a certain call of justRMA (see below); it mixes the sample names of pData but keeps the annotation the same, effectively miss labeling samples without returning a single error or warning.
>
>> phenoData <- read.AnnotatedDataFrame(paste(prodir, "sample_pheno.txt", sep="/"), sep="\t")
>> head(pData(phenoData))
>                                Line        Resistance Media Condition
> 28_MGH006_neg.CEL           MGH006             Naive   D10      Cont
> 30_MGH006_criz.CEL          MGH006             Naive  D10   Criz 300
> 21_MGH010_neg.CEL           MGH010              Criz  R10       Cont
> 15_MGH010_Criz.CEL          MGH010              Criz  R10   Criz 300
> 9_MGH021_neg.CEL   MGH021-2 (pool) Criz (1269 et al)   A10      Cont
> 4_MGH021_LDK.CEL   MGH021-2 (pool) Criz (1269 et al)   A10   LDK 300
>
>
>> eset <- justRMA(phenoData=phenoData, celfile.path=celdir)
>> head(pData(eset))
>                                     Line        Resistance Media Condition
> 10_MGH025_neg.CEL                MGH006             Naive   D10      Cont
> 11_MGH022_neg.CEL                MGH006             Naive  D10   Criz 300
> 12_MGH_049_criz.CEL              MGH010              Criz  R10       Cont
> 13_MGH025_criz_0530.CEL          MGH010              Criz  R10   Criz 300
> 14_MGH025_LDK.CEL       MGH021-2 (pool) Criz (1269 et al)   A10      Cont
> 15_MGH010_Criz.CEL      MGH021-2 (pool) Criz (1269 et al)   A10   LDK 300
>
> ## IN CONTRAST, THIS WORKS CORRECTLY:
> eset <- justRMA(phenoData=paste(prodir, "sample_pheno.txt", sep="/"), celfile.path=celdir)
>
>
> I hope people appreciate this might lead to completely wrong results, and that it should be addressed in some way by the authors ?
>
> Best regards,
>
> Arnaud Amzallag, PhD
> Center for Cancer Research, Massachusetts General Hospital/Harvard Medical School
>
>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] hgu133plus2cdf_2.12.0 AnnotationDbi_1.22.6  affy_1.38.1
> [4] Biobase_2.20.1        BiocGenerics_0.6.0    rj_1.1.3-1
>
> loaded via a namespace (and not attached):
>   [1] BiocInstaller_1.10.4  DBI_0.2-7             IRanges_1.18.4
>   [4] RSQLite_0.11.4        affyio_1.28.0         preprocessCore_1.22.0
>   [7] rj.gd_1.1.3-1         stats4_3.0.1          tools_3.0.1
> [10] zlibbioc_1.6.0
>   
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099