[BioC] Zeros in ReadAffy was Re: farms package fails on estrogen data
Ben Bolstad
bmb at bmbolstad.com
Wed Mar 17 01:51:33 CET 2010
Yes I am aware that allocMatrix does not initialize.
However, it is expected that a data point will be read from the CEL file
for every location in this matrix.
In the case of the GCOS/XDA and Calvin/CommandConsole format CEL file
the order in which things are in the file tells you where to place the
data in the matrix. This should be such that for a non-corrupt file
every entry in the matrix would get assigned an intensity value.
In the case of text format CEL file there are actually x and y
coordinates given to each probe intensity. No specific ordering is
specified in the file documentation, so these x and y coordinates are
used to determine where the relevant intensity value should be placed in
the matrix. not much checking is being done on the validity of these
currently. Note of course that text format CEL files are starting to
become more rare.
Now it turns out that in estrogen there are 9 CEL files. All 9 are text
format.
> library(affy)
> setwd(system.file("extdata", package="estrogen"))
> list.celfiles()
[1] "bad.cel" "high10-1.cel" "high10-2.cel" "high48-1.cel"
"high48-2.cel"
[6] "low10-1.cel" "low10-2.cel" "low48-1.cel" "low48-2.cel"
> ab = ReadAffy()
> range(intensity(ab))
[1] 0 46131
> apply(intensity(ab),2,range)
bad.cel high10-1.cel high10-2.cel high48-1.cel high48-2.cel low10-1.cel
[1,] 0 32.5 49.5 420.5 425 0
[2,] 46131 18308.0 45794.0 46111.0 46108 16457
low10-2.cel low48-1.cel low48-2.cel
[1,] 0 360.5 395
[2,] 20763 46102.0 46115
> which(intensity(ab)==0)
[1] 140224 2143523 2202236 2337976 2841090
> 140224%%640
[1] 64
> 140224%/%640
[1] 219
because things are zero based want to look for entry at 63 219 in
bad.cel
Low and behold:
61 219 129.0 25.0 25
62 219 84.0 14.1 25
6219 133.3 32.4 20
64 219 4241.0 634.0 25
65 219 3215.0 565.7 25
so corrupt x/y location. What about in "low10-1.cel"
> which(intensity(ab)[,6] == 0)%%640
95523 154236 289976
163 636 56
> which(intensity(ab)[,6] == 0)%/%640
95523 154236 289976
149 240 453
And again
161 149 58.8 7.4 20
162 049 100.0 11.4 25
163 149 156.5 31.9 20
and again
634 240 63.3 7.5 20
63240 57.3 8.8 20
636 240 70.0 10.3 25
and here too
54 453 132.0 21.5 25
55 053 94.0 11.5 25
56 453 145.8 19.5 16
In theory at least some these cases should have been caught by the
parser, since there is meant to be a check that ensures that no of those
coordinates are larger the the chip dimensions (which 6219 and 63240
most certainly are). The parser should most certainly be throwing its
hands up in error at this point.
In any case I would think it bad form to be distributing corrupt CEL
files as part of a package.
Going into debugger land and looking at that first cel
gdb) p buffer
$6 = " 63\b219\t133.3\t32.4\t 20\n\000 ..........
and it turns out that it takes x location as 63, y location as 133 and
the intensity value as 32.4 (so in other words it is putting an
incorrect intensity in the wrong place). Now what the heck a "\b" (aka
^H terminal code) is doing here I can't fathom, but you live with what
you get.
Now I could enforce a check that there are five distinct entries (as
separated by tabs) on each row (which is something I avoided in the past
for speed since typically npixels and stddev are ignored) and that would
catch two of those cases. There is not much that can be done about the
catching the other two unless one were to enforce that x/y locations are
sequential. The file documentation used to implement the parser does not
specify that this has to be the case for text format files.
On Tue, 2010-03-16 at 14:55 -0700, Seth Falcon wrote:
> On 3/16/10 2:37 PM, bmb at bmbolstad.com wrote:
> > ReadAffy uses allocMatrix() to allocate the memory into which it intends
> > to place probe intensities read from the file.
>
> Not sure there is an issue here or not based on your comments, but just
> so we are all on the same page, allocMatrix allocates but does not
> initialize memory for a matrix.
>
> + seth
>
>
> Based on information in the
> > CEL file header it determines grid dimensions to determine how many probe
> > intensities are expected. Exact parsing depends on the CEL file format of
> > which there are a few, but generally speaking it is expected that there is
> > an intensity to be read for every location in the grid. Zeros can arise in
> > truncated and corrupted CEL files. But many of those types of situations
> > should be detected and an error thrown. There may be other mechanisms for
> > generation of such probe intensities which make them valid.
> >
> >
> >
> >> On 03/16/2010 08:25 AM, Javier Pérez Florido wrote:
> >>> Dear list,
> >>> I've tried to use l.farms (farms package)
> >>> http://www.bioinf.jku.at/software/farms/farms.html
> >>> for preprocessing the estrogen data set available at
> >>> http://www.bioconductor.org/packages/release/data/experiment/html/estrogen.html
> >>>
> >>>
> >>> When trying to preprocess using l.farms, the following error comes up:
> >>>
> >>> background correction: none
> >>> normalization: loess
> >>> PM/MM correction : pmonly
> >>> expression values: farms
> >>> background correcting...done.
> >>> normalizing...Error: NA/NaN/Inf in call to external function (arg 1)
> >>>
> >>> q.farms works for the same data set and l.farms works for the Dilution
> >>> data set, so, I don't know what is going on here.
> >>>
> >>> Any suggestions?
> >>
> >> Hi Javier --
> >>
> >> If I
> >>
> >> library(farms)
> >> library(estrogen)
> >> setwd(system.file("extdata", package="estrogen"))
> >> ab = ReadAffy()
> >> l.farms(ab)
> >>
> >> I see the error you report. It is because
> >>
> >>> sum(intensity(ab) == 0)
> >> [1] 5
> >>
> >> so that log transformation generates -Inf and then the error. It doesn't
> >> seem too bad to
> >>
> >> intensity(ab) = 1 + intensity(ab)
> >>
> >> [I don't know enough about the inner workings of affyio, but I'm
> >> surprised that I sometimes see
> >>
> >>> range(intensity(ReadAffy()))
> >> [1] 0 46131
> >>
> >>> range(intensity(ReadAffy))
> >> [1] 1.563578e-316 4.613100e+04
> >>
> >> suggesting either initialized memory or unanticipated 'empty' intensity
> >> grid coordinates)]
> >>
> >> Martin
> >>
> >>
> >>> Javier
> >>>
> >>> sessionInfo()
> >>> R version 2.10.1 (2009-12-14)
> >>> i386-pc-mingw32
> >>>
> >>> locale:
> >>> [1] LC_COLLATE=Spanish_Spain.1252 LC_CTYPE=Spanish_Spain.1252
> >>> [3] LC_MONETARY=Spanish_Spain.1252 LC_NUMERIC=C
> >>> [5] LC_TIME=Spanish_Spain.1252
> >>>
> >>> attached base packages:
> >>> [1] stats graphics grDevices utils datasets methods base
> >>>
> >>> other attached packages:
> >>> [1] affyPLM_1.22.0 preprocessCore_1.8.0 gcrma_2.18.1
> >>> [4] affydata_1.11.10 farms_1.4.0 MASS_7.3-4
> >>> [7] hgu95av2cdf_2.5.0 affy_1.24.2 Biobase_2.6.1
> >>>
> >>> loaded via a namespace (and not attached):
> >>> [1] affyio_1.14.0 Biostrings_2.14.12 IRanges_1.4.11
> >>> splines_2.10.1
> >>> [5] tools_2.10.1
> >>>
> >>> _______________________________________________
> >>> Bioconductor mailing list
> >>> Bioconductor at stat.math.ethz.ch
> >>> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>> Search the archives:
> >>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>
> >> --
> >> Martin Morgan
> >> Computational Biology / Fred Hutchinson Cancer Research Center
> >> 1100 Fairview Ave. N.
> >> PO Box 19024 Seattle, WA 98109
> >>
> >> Location: Arnold Building M1 B861
> >> Phone: (206) 667-2793
> >>
> >> _______________________________________________
> >> Bioconductor mailing list
> >> Bioconductor at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/bioconductor
> >> Search the archives:
> >> http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at stat.math.ethz.ch
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Seth Falcon
> Bioconductor Core Team | FHCRC
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list