[BioC] Unable to load in Affymetrix data CEL files with readCelHeader
Henrik Bengtsson
hb at stat.berkeley.edu
Tue May 13 23:53:52 CEST 2008
Hi,
On Tue, May 13, 2008 at 12:31 PM, Robert Gentleman <rgentlem at fhcrc.org> wrote:
> Hi Matt,
>
>
> Matt K wrote:
>
> > I am having problems reading in some publicly available chromosome X
> > titration Nsp chip CEL files. The data are available from the Affymetrix
> > website:
> >
> >
> http://www.affymetrix.com/support/technical/sample_data/copy_number_data.affx
> >
>
> I did not see any file there that obviously contained the CEL files, can
> you say what you downloaded (provided the questions below don't solve your
> problem).
FYI/for the record, they are in the *.DTT files (basically a *.zip
archive split in many files). It is quite tedious to extract the
files if you don't have the right tools, but it is still possible with
a basic Unix setup. See
http://groups.google.com/group/aroma-affymetrix/web/mapping250k-nsp-mapping250k-sty
and the reference to Page 'Affymetrix multi-part DTT/ZIP archives' for
the details.
>
>
>
> >
> > I have not modified the data in anyway. Here is what happens when I try to
> > read the data:
> >
> >
> > > library(affxparser)
> > > path <- "./rawData/3X/Mapping250K_Nsp/"
> > > pathnames <- list.files(path=path, pattern="[.](cel|CEL)$",
> > >
> > full.names=TRUE)
> >
> > > pathnames
> > >
> > [1] "./rawData/3X/Mapping250K_Nsp//NA04626_NSP_R1.CEL"
> > [2] "./rawData/3X/Mapping250K_Nsp//NA04626_NSP_R2.CEL"
> > [3] "./rawData/3X/Mapping250K_Nsp//NA04626_NSP_R3.CEL"
> > [4] "./rawData/3X/Mapping250K_Nsp//NA04626_NSP_R4.CEL"
> >
> > > hdr <- readCelHeader(pathnames[1])
> > >
> > terminate called after throwing an instance of
> > 'affymetrix_calvin_exceptions::UnableToOpenFileException'
That is sometime how 'affxparser' responds to corrupt files. There is
currently no exception handling at the native-code level in affxparser
causing it to core dump on bad files. It's on the to do list but with
low priority.
FYI, I've got those files as well and I can read the perfectly well
using affxparser v1.11.3:
library(affxparser);
path <- "rawData/Affymetrix_2006-Chromosome_X/Mapping250K_Nsp/";
pathnames <- list.files(pattern="[.]CEL$", path=path, full.names=TRUE);
hdrs <- lapply(pathnames, readCelHeader);
Since it is quite tricky that to extract the CEL files from the DTT
files, you might have got something wrong there. You might also have
downloaded the DTT, D02, D03, ... files in text mode and not binary
mode (adding/removing extract bytes). My notes on
http://groups.google.com/group/aroma-affymetrix/web/affymetrix-multi-part-dtt-zip-archives
might shine some light on your problem.
For your troubleshooting, here is are some details (make sure to have
the latest version of 'digest' installed):
library(digest);
x <- lapply(pathnames, FUN=function(pathname) {
c(basename(pathname), file.info(pathname)$size, digest(file=pathname))
});
x <- as.data.frame(matrix(unlist(x), ncol=3, byrow=TRUE));
colnames(x) <- c("filename", "bytes", "md5");
print(x);
filename bytes md5
1 NA01416_NSP_R1.CEL 65743954 edce95d22481a133bcadb4faa79eb8d5
2 NA01416_NSP_R2.CEL 65701910 89291d7fb32b43ce6a9c83716d3db747
3 NA01416_NSP_R3.CEL 65727988 fba34620e18b6b3de8ff3a394ed0e313
4 NA01416_NSP_R4.CEL 65730078 0b26c2ac467fda36182855eca1e005e5
5 NA04626_NSP_R1.CEL 65693986 e720285a271506ea79bcc067feb90066
6 NA04626_NSP_R2.CEL 65725339 e7934222ecd4b0f9f46bcea27aa53549
7 NA04626_NSP_R3.CEL 65703014 7300b841c8b7eafdd73f6272de7e551d
8 NA04626_NSP_R4.CEL 65712493 64a9c11f8bc268ea17632f23264f0be1
9 NA06061_NSP_R1.CEL 65741243 51cae8fcdf3761ef4ce5d68ef980dfce
10 NA06061_NSP_R2.CEL 65703467 01a19363f226765b5c3b10cd5607dc98
11 NA06061_NSP_R3.CEL 65721991 9fdc58f8036457caa3551bc9eb8cd046
12 NA06061_NSP_R4.CEL 65712766 e7fa2d9adf599fd29b80c121b7e4dfe7
Hope this helps
Henrik
> >
>
> That suggests that you may not have read permission on them. Did you check
> and see if you could open those files with any other tool/editor?
>
> You could just try and open them and read them using standard R tools to
> see if that works (if they are binary CEL files then you will just get junk
> from readLines, but that isn't the issue, you just want to know if they can
> be opened from R).
>
> Your R is out of date, and this sessionInfo is not correct, as you should
> have had affxparser attached and it is not. Please don't do that, mixing
> and matching error messages and sessionInfo output makes life hard for
> anyone that wants to help. Step one is to update R and BioC...
>
>
>
>
> >
> > Process R aborted at Tue May 13 15:01:55 2008
> >
> > As you see R aborts. The same failure happens when I try to load in any of
> > the other CEL files. My R session info is:
> >
> >
> > > sessionInfo()
> > >
> > R version 2.7.0 Under development (unstable) (2008-01-21 r44087)
> > x86_64-unknown-linux-gnu
> >
> > locale:
> >
> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
> >
> > attached base packages:
> > [1] stats graphics grDevices utils datasets methods base
> >
> > Thanks for any help.
> >
> > Matt
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >
> >
>
> --
> Robert Gentleman, PhD
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M2-B876
> PO Box 19024
> Seattle, Washington 98109-1024
> 206-667-7700
> rgentlem at fhcrc.org
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list