[Bioc-devel] Dimension (rows & columns) in the writeCDF function seems to be swapped

Fri May 21 02:25:42 CEST 2010

CC to Ben Bolstad who may want to confirm a statement below and the
maintainer of affy, since the bug is there

Executive summary: there is a bug in read.affybatch having to do with
non-square microarrays (ncol and nrow were switched).  This comes from
a misunderstanding of the output of affyio::read.celfile.header which
lists dimensions as c(ncol, nrow), not as c(nrow, ncol).  I have added
names to the output of affyio::read.celfile.header and I have changed
affy::read.affybatch to use these names instead of the order of the
output.  I have committed the changes to the devel branch, but I think
release needs to be updated as well.  Version numbers have not been
bumped.  I will wait for reactions to this email first.

Long summary: To the people coming in a bit late: This is a
complicated issue Philip has reported with a pipeline taking a
platform design object, converting it into a binary CDF file using
affxparser as well as a script from the aroma.affymetrix website,
converting the cdf file into a cdf environment using makecdfenv and
finally calling rma from affy.  All of this on a non-square cel file.

Data is in the zip file
  https://sendit.wur.nl/Download.aspx?id=7becdf12-35ff-44c3-a4a7-537d0fe895a2
The main script from Philip is CreateCel.R

Note that the CDF file included above has "switched" dimensions, ie.
Philip has modified a script to swap dimensions.

Essentially, it all boils down to the following: read.celfile.header yields
> read.celfile.header("MouseBrain_1.CEL")
$`CEL dimensions`
[1]  990 1190

Contrary to what one immediately would assume, this is actually
nColumns, nRows as stated by Ben Bolstad in another thread (and I
would agree based on a casual reading of the source code from affyio).
 I also get (using affxparser, output scrubbed)
> readCelHeader("MouseBrain_1.CEL")
$cols
[1] 990

$rows
[1] 1190

However, the function read.affybatch in affy assumes that the
dimensions are nRows x nColumns, as seen in the source code (line
numbers from read.affybatch.R from subversion):

     71   headdetails <- read.celfile.header(as.character(filenames[[1]]))
     72   ##now we use the length
     73   dim.intensity <- headdetails[[2]]   ##dim(intensity(cel))

<SNIP>

   108   if (!sd){
    109     return(new("AffyBatch",
    110                exprs  = exprs,
    111                ##se.exprs = array(NaN, dim=dim.sd),
    112                cdfName    = cdfname,   ##cel at cdfName,
    113                phenoData  = phenoData,
    114                nrow       = dim.intensity[1],
    115                ncol       = dim.intensity[2],
    116                annotation = cleancdfname(cdfname, addcdf=FALSE),
    117                protocolData  = protocol,
    118                description= description,
    119                notes      = notes))

Clearly the current author assumed the dimensions would be nrow x ncol
and not reversed.

Finally, a FYI to Ben: currently, the show method for AffyBatch does
_not_ report nCol x nRow as stated in another thread, as per

> getMethod("show", "AffyBatch")
<SNIP>
    cat("size of arrays=", nrow(object), "x", ncol(object), " features (",
        object.size(object)%/%1024, " kb)\n", sep = "")
<SNIP>

A short time bug fix would be to change line 73 in read.affybatch.R from
  dim.intensity <- headdetails[[2]]
to
  dim.intensity <- rev(headdetails[[2]])

A better long term fix would be to put names on the output of
affyio::read.celfile.header so that the order is documented, and also
to

Kasper

On Thu, May 20, 2010 at 8:30 AM, Kasper Daniel Hansen
<khansen at stat.berkeley.edu> wrote:
> Your zip file does not seem to be working.
>
> Kasper
>
> On Thu, May 20, 2010 at 6:12 AM, Groot, Philip de <philip.degroot at wur.nl> wrote:
>> Hello Karst,
>>
>> You can download all required files via this link:
>> https://sendit.wur.nl/Download.aspx?id=9eed2afd-40a8-4770-b44b-5c0b6e55084c
>>
>> Simply execute the commands in the Create_CDF.R script and everything you need will be created and installed.
>>
>> Note that also a fixed "PdInfo2Cdf.R" file is included. To create a CDF with the rows & columns NOT swapped (so that gives the error during RMA) please edit the script at the commented-out lines. Currently, the script creates a proper CDF that also does the RMA normalization properly.
>>
>> Thank you for your help!
>>
>> Regards,
>>
>> Dr. Philip de Groot Ph.D.
>> Bioinformatics Researcher
>>
>> Wageningen University / TIFN
>> Nutrigenomics Consortium
>> Nutrition, Metabolism & Genomics Group
>> Division of Human Nutrition
>> PO Box 8129, 6700 EV Wageningen
>> Visiting Address: Erfelijkheidsleer: De Valk, Building 304
>> Dreijenweg 2, 6703 HA  Wageningen
>> Room: 0052a
>> T: +31-317-485786
>> F: +31-317-483342
>> E-mail:   Philip.deGroot at wur.nl
>> Internet: http://www.nutrigenomicsconsortium.nl
>>             http://humannutrition.wur.nl/
>>             https://madmax.bioinformatics.nl/
>> ________________________________________
>> From: kasperdanielhansen at gmail.com [kasperdanielhansen at gmail.com] On Behalf Of Kasper Daniel Hansen [khansen at stat.berkeley.edu]
>> Sent: 19 May 2010 18:34
>> To: Groot, Philip de
>> Cc: bioc-devel at stat.math.ethz.ch; Hooiveld, Guido
>> Subject: Re: [Bioc-devel] Dimension (rows & columns) in the writeCDF function   seems to be swapped
>>
>> Hi Philip
>>
>> I am trying to understand what you report here.
>>
>> If I use the "wrong" cdf file and I read it using either readCdfHeader
>> in affxparser or read.cdffile.list (full output below) in affyio I get
>>  Cols = 990
>>  Rows = 1190
>> and I have looked at the binary file using hexdump and this is also
>> what is in the file according to Affymetrix file format
>> specifications.
>>
>> This means that I can understand your message as either
>> 1) the file pdmogene11stv1_wrong.cdf does not reflect the dimensions
>> in the PDinfo package.
>> 2) Using this cdf file as input to makecdfenv yields wrong results.
>>
>> Now, when I install the pd.mogene.1.1.st.v1 package I get
>>> pd.mogene.1.1.st.v1
>> Class........: AffyGenePDInfo
>> Manufacturer.: Affymetrix
>> Genome Build.: MM09
>> Chip Geometry: 1190 rows x  990 columns
>>
>> which looks correct.  Not that the pdinfo2cdf script posted on the
>> aroma.affymetrix website has an error (the lines
>>  celHead <- readCelHeader(celfile)
>>  nrows <- celHead$rows
>>  ncols <- celHead$rows
>> ), but you say in your email you have modifed the script (which never
>> made it through), and given that the CDF file corresponds to the
>> pdinfo object you have probably fixed this).
>>
>> That leaves 2).  A casual inspection of the tarball
>>  mogene11stv1cdf_2.1.0_notSwitched.tar.gz
>> makes it look as if the dimensions here are correct as well.  However,
>> I suspect I am missing something.
>>
>> In order to debug this further (and there could very well be an error
>> somewhere, where cols and rows are switched) we will need your scripts
>> for generating the normalized intensities and we probably also need
>> the cell files.  Please modify your scripts to use the exact same
>> filenames as the packages you have posted on sendit.
>>
>> And not that you cannot attach anything (or at least, your attachments
>> does not make it through to me).
>>
>> Kasper
>>
>>
>>> tmp = read.cdffile.list("pdmogene11stv1_wrong.cdf")
>>> tmp$Header
>> $Dimensions
>>  MagicNumber VersionNumber          Cols          Rows     n.QCunits
>>           67             1           990          1190             0
>>      n.units     LenRefSeq
>>        35556             0
>>
>> $ReseqRefSeq
>> [1] ""
>>
>>> readCdfHeader("pdmogene11stv1_wrong.cdf")
>> $ncols
>> [1] 990
>>
>> $nrows
>> [1] 1190
>>
>> $nunits
>> [1] 35556
>>
>> $nqcunits
>> [1] 0
>>
>> $refseq
>> [1] ""
>>
>> $chiptype
>> [1] "pdmogene11stv1_wrong"
>>
>> $filename
>> [1] "./pdmogene11stv1_wrong.cdf"
>>
>> $rows
>> [1] 1190
>>
>> $cols
>> [1] 990
>>
>> $probesets
>> [1] 35556
>>
>> $qcprobesets
>> [1] 0
>>
>> $reference
>> [1] ""
>>
>>>
>>
>>
>> On Wed, May 19, 2010 at 4:00 AM, Groot, Philip de <philip.degroot at wur.nl> wrote:
>>> Dear Kasper,
>>>
>>> I am contacting you because I am puzzled with an issue that is related to affxparser. I submitted this message to the bioc-devel list on april 29 (subject: "RE: [Bioc-devel] FW: problem in ReadAffy function (affy and affyio libraries)") but never received any reply. Interestingly, in the file writeCDF.R (affxparser source) the following comment is present:
>>>
>>> # 2006-09-11 /HB
>>> # o BUG FIX: nrows & ncols were swapped in the CDF header.
>>> The problem:
>>> I succesfully created a "mogene11stv1cdf" library that can be used with the "affy" library. Benefit of this is that e.g. the affyPLM library can be used for creating informative plots. I applied rma utilizing affy and oligo and found exactly the same normalized intensities (see attached png-image).
>>>
>>> So far so good, BUT... in order to get rma working for affy, I needed to switch the number or rows and columns when creating the CDF-file. And this puzzles me a lot! I do not understand where this originates from.
>>>
>>> Let me explain. I use the oligo library and the "pd.mogene.1.1.st.v1" annotation file to create the CDF-file. I (together with Guido Hooiveld) adapted the original "PdInfo2Cdf.R" script for this purpose. Information on the original "PdInfo2Cdf.R" script is here: http://www.aroma-project.org/node/40. The adapted file is also in the attachment.
>>>
>>> The CDF-file is created as follows:
>>> source("PdInfo2Cdf.R")
>>> PdInfo2Cdf("pd.mogene.1.1.st.v1", <An appropriate .CEL-file>);
>>>
>>> library(makecdfenv)
>>> make.cdf.package(file="pdmogene11stv1.cdf", packagename = "mogene11stv1cdf", author="Philip de Groot", maintainer="Philip de Groot <Philip.deGroot at wur.nl>", version="2.1.0", species="Mus_musculus")
>>>
>>> Both CDF-files are available in a single zip-file via sendit:
>>> https://sendit.wur.nl/Download.aspx?id=9299e472-1d5a-4f83-af75-c10aa1faf94c
>>>
>>> Does somebody has an explanation for this? Is it a bug in writeCDF?
>>>
>>> Regards,
>>>
>>> Dr. Philip de Groot Ph.D.
>>> Bioinformatics Researcher
>>>
>>> Wageningen University / TIFN
>>> Nutrigenomics Consortium
>>> Nutrition, Metabolism & Genomics Group
>>> Division of Human Nutrition
>>> PO Box 8129, 6700 EV Wageningen
>>> Visiting Address: Erfelijkheidsleer: De Valk, Building 304
>>> Dreijenweg 2, 6703 HA  Wageningen
>>> Room: 0052a
>>> T: +31-317-485786
>>> F: +31-317-483342
>>> E-mail:   Philip.deGroot at wur.nl<mailto:Philip.deGroot at wur.nl>
>>> Internet: http://www.nutrigenomicsconsortium.nl<http://www.nutrigenomicsconsortium.nl/>
>>>             http://humannutrition.wur.nl/
>>>             https://madmax.bioinformatics.nl/
>>>
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>