[BioC] WARNING: difference in sorting order depending on computer platform?!?

Jenny Drnevich drnevich at illinois.edu
Thu Jan 28 21:16:24 CET 2010


Hi all,

I just found a problem/discrepancy in running R on PC vs. Unix/Linux 
server. Maybe it's widely known, but I didn't know about it and it 
caused me big problems. I mostly use my desktop PC for running 
microarray analyses, but occasionally I have projects that require 
more memory. Then I run some of the memory-intensive steps on our 
Linux server, (which has a lot more memory but is REALLY slow), save 
the objects, and go back to my PC to finish the analysis. Well, it 
turns out that the order of probe set IDs as returned by 
featureNames() is slightly different between the computer platforms. 
I first thought it might be do to a difference in the chipnamecdf 
library Windows binary vs. *nix compilation of the source file, but I 
think it's just a difference in the way the computer platforms sort 
character data that have numbers. I've put a full, reproducible 
example below (our sys admin hasn't upgraded R on the server yet, but 
I doubt that's the problem), but in short, my PC puts 177_at before 
1773_at, but the server puts 1773_at before 177_at.

I guess this really isn't a "bug" that can be fixed, and I know it's 
not a good idea to run part of your R code on one computer and part 
on another computer, but don't you agree that this is undesirable 
behavior?  Maybe I'm not computer-literate enough to have known that 
this is a well-known issue, so in part I'm posting this as a warning 
to others like me - I don't remember seeing anything like this in the 
4+ years I've been following the BioC list. I also wondering in 
addition to however many of my analyses that may have been messed up 
slightly (ARRRGGHH!!), would this possibly cause problems in things 
like public repositories? I know databases don't depend on order, but 
I'd be surprised if it hasn't caused problems somewhere else. In this 
case, there's only 117 probe sets out of 22,277 that don't match up, 
so it would be hard to notice!

Thanks,
Jenny


 > library(affy)
Loading required package: Biobase

Welcome to Bioconductor

   Vignettes contain introductory material. To view, type
   'openVignette()'. To cite Bioconductor, see
   'citation("Biobase")' and for packages 'citation(pkgname)'.

 > library(ArrayExpress)
 >
 > rawset = ArrayExpress("E-MEXP-1422")
trying URL 'http://www.ebi.ac.uk/microarray-as/ae/files/E-MEXP-1422/index.html'
Content type 'text/html;charset=ISO-8859-1' length unknown
opened URL
downloaded 7746 bytes

trying URL 
'http://www.ebi.ac.uk/microarray-as/ae/files/E-MEXP-1422/E-MEXP-1422.raw.1.zip'
Content type 'application/zip' length 11200346 bytes (10.7 Mb)
opened URL
downloaded 10.7 Mb

Read 1 item
trying URL 
'http://www.ebi.ac.uk/microarray-as/ae/files/E-MEXP-1422/E-MEXP-1422.sdrf.txt'
Content type 'text/plain' length 6679 bytes
opened URL
downloaded 6679 bytes

trying URL 
'http://www.ebi.ac.uk/microarray-as/ae/files/A-AFFY-37/A-AFFY-37.adf.txt'
Content type 'text/plain' length 3590863 bytes (3.4 Mb)
opened URL
downloaded 3.4 Mb

trying URL 
'http://www.ebi.ac.uk/microarray-as/ae/files/E-MEXP-1422/E-MEXP-1422.idf.txt'
Content type 'text/plain' length 5378 bytes
opened URL
downloaded 5378 bytes

Read 49 items

  The object containing experiment  E-MEXP-1422  has been built.

 > rawset
AffyBatch object
size of arrays=732x732 features (8499 kb)
cdf=HG-U133A_2 (22277 affyids)
number of samples=6
number of genes=22277
annotation=hgu133a2
notes=E-MEXP-1422
         E-MEXP-1422
         RNAi
         c("cellular_modification_design", "co-expression_design", 
"in_vitro_design", "RNAi")
         NULL
 >
 > PSnames.PC <- featureNames(rawset)
 >
 > all.equal(PSnames.PC, featureNames(rawset))
[1] TRUE
 >
 > save.image("NameOrderTest.RData")
 >
 > sessionInfo()
R version 2.10.1 (2009-12-14)
i386-pc-mingw32

locale:
[1] LC_COLLATE=English_United States.1252
[2] LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hgu133a2cdf_2.5.0  ArrayExpress_1.6.1 affy_1.24.2        Biobase_2.6.1

loaded via a namespace (and not attached):
[1] affyio_1.14.0        limma_3.2.1          preprocessCore_1.8.0
[4] tools_2.10.1         XML_2.6-0
 >
 > q()


# now move to Linux server:


 > library(affy)
Loading required package: Biobase

Welcome to Bioconductor

   Vignettes contain introductory material. To view, type
   'openVignette()'. To cite Bioconductor, see
   'citation("Biobase")' and for packages 'citation(pkgname)'.

 >
 >
 >
 > load("NameOrderTest.RData")
 >
 >
 >
 > all.equal(PSnames.PC, featureNames(rawset))
[1] "117 string mismatches"
 >
 >
 > x <- data.frame(PC=PSnames.PC, Linux=featureNames(rawset), 
stringsAsFactors=F)
 >
 > x[ x[,1] != x[,2] , ][ 1:5 , ]
             PC     Linux
17      177_at   1773_at
18     1773_at    177_at
2328 2028_s_at 202800_at
2329 202800_at 202801_at
2330 202801_at 202802_at
 >
 >
 > all.equal(sort(PSnames.PC), featureNames(rawset))
[1] TRUE
 >
 >
 > PSnames.linux <- featureNames(rawset)
 >
 > save.image("NameOrderTest.RData")
 >
 > sessionInfo()
R version 2.9.0 (2009-04-17)
x86_64-unknown-linux-gnu

locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] hgu133a2cdf_2.4.0 affy_1.22.0       Biobase_2.4.0

loaded via a namespace (and not attached):
[1] affyio_1.8.1         preprocessCore_1.6.0 tools_2.9.0
 >
 > q()


# now move back to PC:

 > library(affy)
Loading required package: Biobase

Welcome to Bioconductor

   Vignettes contain introductory material. To view, type
   'openVignette()'. To cite Bioconductor, see
   'citation("Biobase")' and for packages 'citation(pkgname)'.

 > load("NameOrderTest.RData")
 >
 > all.equal(PSnames.PC, featureNames(rawset))
[1] TRUE
 >
 > all.equal(PSnames.linux, featureNames(rawset))
[1] "117 string mismatches"
 >
 > all.equal(sort(PSnames.linux), featureNames(rawset))
[1] TRUE











Jenny Drnevich, Ph.D.

Functional Genomics Bioinformatics Specialist
W.M. Keck Center for Comparative and Functional Genomics
Roy J. Carver Biotechnology Center
University of Illinois, Urbana-Champaign

330 ERML
1201 W. Gregory Dr.
Urbana, IL 61801
USA

ph: 217-244-7355
fax: 217-265-5066
e-mail: drnevich at illinois.edu



More information about the Bioconductor mailing list