[Bioc-devel] Dimension (rows & columns) in the writeCDF function seems to be swapped

Fri May 21 10:21:27 CEST 2010

Hello all,

I am glad that the issue is resolved at last (Thanks Kasper!). Consequently, I (or the core project members) can provide CDF-files (derived from the pdInfoBuilder packages) for the hugene, mogene, ragene ST v1.0 and 1.1 arrays so that these arrays can also be analysed utilizing the "affy" library.

The question is: who takes responsibility for this? Currently, I maintain the NuGO (custom Affymetrix) arrays and including the CDF-files (derived from the pdInfoBuilder files) for these arrays  is no big deal.

However, the issue is that for the Gene ST 1.0 arrays non-supported CDF-files have been released by Affymetrix which are reflected in the currently available CDF-environments in the Bioconductor repository. Unfortunately, the different CDF-environments for the same chip do not match, for example:

Freshly installed mogene10stv1cdf library from Bioconductor:
> length(ls(mogene10stv1cdf))
[1] 34760

And the CDF-library as compiled by me (derived from pdInfoBuilder):
> length(ls(mogene10stv1cdf))
[1] 35556

Personally, I prefer the CDF-libraries derived from the pdInfoBuilder libraries, because these exactly reflects the chips as Affymetrix intended it (utilizing the .bgp, .clf, and so on files from Affymetrix: the official support files).

So I think we need to discuss on this issue too... What CDF files will be made available, who takes responsibility, and who else needs to be included?

Regards,

Dr. Philip de Groot Ph.D.
Bioinformatics Researcher

Wageningen University / TIFN
Nutrigenomics Consortium
Nutrition, Metabolism & Genomics Group
Division of Human Nutrition
PO Box 8129, 6700 EV Wageningen
Visiting Address: Erfelijkheidsleer: De Valk, Building 304
Dreijenweg 2, 6703 HA  Wageningen
Room: 0052a
T: +31-317-485786
F: +31-317-483342
E-mail:   Philip.deGroot at wur.nl
Internet: http://www.nutrigenomicsconsortium.nl
             http://humannutrition.wur.nl/
             https://madmax.bioinformatics.nl/
________________________________________
From: kasperdanielhansen at gmail.com [kasperdanielhansen at gmail.com] On Behalf Of Kasper Daniel Hansen [khansen at stat.berkeley.edu]
Sent: 21 May 2010 02:25
To: Groot, Philip de
Cc: bioc-devel at stat.math.ethz.ch; Hooiveld, Guido; bmb at bmbolstad.com; rafa at jhu.edu; Henrik Bengtsson
Subject: Re: [Bioc-devel] Dimension (rows & columns) in the writeCDF function   seems to be swapped

CC to Ben Bolstad who may want to confirm a statement below and the
maintainer of affy, since the bug is there

Executive summary: there is a bug in read.affybatch having to do with
non-square microarrays (ncol and nrow were switched).  This comes from
a misunderstanding of the output of affyio::read.celfile.header which
lists dimensions as c(ncol, nrow), not as c(nrow, ncol).  I have added
names to the output of affyio::read.celfile.header and I have changed
affy::read.affybatch to use these names instead of the order of the
output.  I have committed the changes to the devel branch, but I think
release needs to be updated as well.  Version numbers have not been
bumped.  I will wait for reactions to this email first.

Long summary: To the people coming in a bit late: This is a
complicated issue Philip has reported with a pipeline taking a
platform design object, converting it into a binary CDF file using
affxparser as well as a script from the aroma.affymetrix website,
converting the cdf file into a cdf environment using makecdfenv and
finally calling rma from affy.  All of this on a non-square cel file.

Data is in the zip file
  https://sendit.wur.nl/Download.aspx?id=7becdf12-35ff-44c3-a4a7-537d0fe895a2
The main script from Philip is CreateCel.R

Note that the CDF file included above has "switched" dimensions, ie.
Philip has modified a script to swap dimensions.

Essentially, it all boils down to the following: read.celfile.header yields
> read.celfile.header("MouseBrain_1.CEL")
$`CEL dimensions`
[1]  990 1190

Contrary to what one immediately would assume, this is actually
nColumns, nRows as stated by Ben Bolstad in another thread (and I
would agree based on a casual reading of the source code from affyio).
 I also get (using affxparser, output scrubbed)
> readCelHeader("MouseBrain_1.CEL")
$cols
[1] 990

$rows
[1] 1190

However, the function read.affybatch in affy assumes that the
dimensions are nRows x nColumns, as seen in the source code (line
numbers from read.affybatch.R from subversion):

     71   headdetails <- read.celfile.header(as.character(filenames[[1]]))
     72   ##now we use the length
     73   dim.intensity <- headdetails[[2]]   ##dim(intensity(cel))

<SNIP>

   108   if (!sd){
    109     return(new("AffyBatch",
    110                exprs  = exprs,
    111                ##se.exprs = array(NaN, dim=dim.sd),
    112                cdfName    = cdfname,   ##cel at cdfName,
    113                phenoData  = phenoData,
    114                nrow       = dim.intensity[1],
    115                ncol       = dim.intensity[2],
    116                annotation = cleancdfname(cdfname, addcdf=FALSE),
    117                protocolData  = protocol,
    118                description= description,
    119                notes      = notes))

Clearly the current author assumed the dimensions would be nrow x ncol
and not reversed.

Finally, a FYI to Ben: currently, the show method for AffyBatch does
_not_ report nCol x nRow as stated in another thread, as per

> getMethod("show", "AffyBatch")
<SNIP>
    cat("size of arrays=", nrow(object), "x", ncol(object), " features (",
        object.size(object)%/%1024, " kb)\n", sep = "")
<SNIP>

A short time bug fix would be to change line 73 in read.affybatch.R from
  dim.intensity <- headdetails[[2]]
to
  dim.intensity <- rev(headdetails[[2]])

A better long term fix would be to put names on the output of
affyio::read.celfile.header so that the order is documented, and also
to

Kasper

On Thu, May 20, 2010 at 8:30 AM, Kasper Daniel Hansen
<khansen at stat.berkeley.edu> wrote:
> Your zip file does not seem to be working.
>
> Kasper
>
> On Thu, May 20, 2010 at 6:12 AM, Groot, Philip de <philip.degroot at wur.nl> wrote:
>> Hello Karst,
>>
>> You can download all required files via this link:
>> https://sendit.wur.nl/Download.aspx?id=9eed2afd-40a8-4770-b44b-5c0b6e55084c
>>
>> Simply execute the commands in the Create_CDF.R script and everything you need will be created and installed.
>>
>> Note that also a fixed "PdInfo2Cdf.R" file is included. To create a CDF with the rows & columns NOT swapped (so that gives the error during RMA) please edit the script at the commented-out lines. Currently, the script creates a proper CDF that also does the RMA normalization properly.
>>
>> Thank you for your help!
>>
>> Regards,
>>
>> Dr. Philip de Groot Ph.D.
>> Bioinformatics Researcher
>>
>> Wageningen University / TIFN
>> Nutrigenomics Consortium
>> Nutrition, Metabolism & Genomics Group
>> Division of Human Nutrition
>> PO Box 8129, 6700 EV Wageningen
>> Visiting Address: Erfelijkheidsleer: De Valk, Building 304
>> Dreijenweg 2, 6703 HA  Wageningen
>> Room: 0052a
>> T: +31-317-485786
>> F: +31-317-483342
>> E-mail:   Philip.deGroot at wur.nl
>> Internet: http://www.nutrigenomicsconsortium.nl
>>             http://humannutrition.wur.nl/
>>             https://madmax.bioinformatics.nl/
>> ________________________________________
>> From: kasperdanielhansen at gmail.com [kasperdanielhansen at gmail.com] On Behalf Of Kasper Daniel Hansen [khansen at stat.berkeley.edu]
>> Sent: 19 May 2010 18:34
>> To: Groot, Philip de
>> Cc: bioc-devel at stat.math.ethz.ch; Hooiveld, Guido
>> Subject: Re: [Bioc-devel] Dimension (rows & columns) in the writeCDF function   seems to be swapped
>>
>> Hi Philip
>>
>> I am trying to understand what you report here.
>>
>> If I use the "wrong" cdf file and I read it using either readCdfHeader
>> in affxparser or read.cdffile.list (full output below) in affyio I get
>>  Cols = 990
>>  Rows = 1190
>> and I have looked at the binary file using hexdump and this is also
>> what is in the file according to Affymetrix file format
>> specifications.
>>
>> This means that I can understand your message as either
>> 1) the file pdmogene11stv1_wrong.cdf does not reflect the dimensions
>> in the PDinfo package.
>> 2) Using this cdf file as input to makecdfenv yields wrong results.
>>
>> Now, when I install the pd.mogene.1.1.st.v1 package I get
>>> pd.mogene.1.1.st.v1
>> Class........: AffyGenePDInfo
>> Manufacturer.: Affymetrix
>> Genome Build.: MM09
>> Chip Geometry: 1190 rows x  990 columns
>>
>> which looks correct.  Not that the pdinfo2cdf script posted on the
>> aroma.affymetrix website has an error (the lines
>>  celHead <- readCelHeader(celfile)
>>  nrows <- celHead$rows
>>  ncols <- celHead$rows
>> ), but you say in your email you have modifed the script (which never
>> made it through), and given that the CDF file corresponds to the
>> pdinfo object you have probably fixed this).
>>
>> That leaves 2).  A casual inspection of the tarball
>>  mogene11stv1cdf_2.1.0_notSwitched.tar.gz
>> makes it look as if the dimensions here are correct as well.  However,
>> I suspect I am missing something.
>>
>> In order to debug this further (and there could very well be an error
>> somewhere, where cols and rows are switched) we will need your scripts
>> for generating the normalized intensities and we probably also need
>> the cell files.  Please modify your scripts to use the exact same
>> filenames as the packages you have posted on sendit.
>>
>> And not that you cannot attach anything (or at least, your attachments
>> does not make it through to me).
>>
>> Kasper
>>
>>
>>> tmp = read.cdffile.list("pdmogene11stv1_wrong.cdf")
>>> tmp$Header
>> $Dimensions
>>  MagicNumber VersionNumber          Cols          Rows     n.QCunits
>>           67             1           990          1190             0
>>      n.units     LenRefSeq
>>        35556             0
>>
>> $ReseqRefSeq
>> [1] ""
>>
>>> readCdfHeader("pdmogene11stv1_wrong.cdf")
>> $ncols
>> [1] 990
>>
>> $nrows
>> [1] 1190
>>
>> $nunits
>> [1] 35556
>>
>> $nqcunits
>> [1] 0
>>
>> $refseq
>> [1] ""
>>
>> $chiptype
>> [1] "pdmogene11stv1_wrong"
>>
>> $filename
>> [1] "./pdmogene11stv1_wrong.cdf"
>>
>> $rows
>> [1] 1190
>>
>> $cols
>> [1] 990
>>
>> $probesets
>> [1] 35556
>>
>> $qcprobesets
>> [1] 0
>>
>> $reference
>> [1] ""
>>
>>>
>>
>>
>> On Wed, May 19, 2010 at 4:00 AM, Groot, Philip de <philip.degroot at wur.nl> wrote:
>>> Dear Kasper,
>>>
>>> I am contacting you because I am puzzled with an issue that is related to affxparser. I submitted this message to the bioc-devel list on april 29 (subject: "RE: [Bioc-devel] FW: problem in ReadAffy function (affy and affyio libraries)") but never received any reply. Interestingly, in the file writeCDF.R (affxparser source) the following comment is present:
>>>
>>> # 2006-09-11 /HB
>>> # o BUG FIX: nrows & ncols were swapped in the CDF header.
>>> The problem:
>>> I succesfully created a "mogene11stv1cdf" library that can be used with the "affy" library. Benefit of this is that e.g. the affyPLM library can be used for creating informative plots. I applied rma utilizing affy and oligo and found exactly the same normalized intensities (see attached png-image).
>>>
>>> So far so good, BUT... in order to get rma working for affy, I needed to switch the number or rows and columns when creating the CDF-file. And this puzzles me a lot! I do not understand where this originates from.
>>>
>>> Let me explain. I use the oligo library and the "pd.mogene.1.1.st.v1" annotation file to create the CDF-file. I (together with Guido Hooiveld) adapted the original "PdInfo2Cdf.R" script for this purpose. Information on the original "PdInfo2Cdf.R" script is here: http://www.aroma-project.org/node/40. The adapted file is also in the attachment.
>>>
>>> The CDF-file is created as follows:
>>> source("PdInfo2Cdf.R")
>>> PdInfo2Cdf("pd.mogene.1.1.st.v1", <An appropriate .CEL-file>);
>>>
>>> library(makecdfenv)
>>> make.cdf.package(file="pdmogene11stv1.cdf", packagename = "mogene11stv1cdf", author="Philip de Groot", maintainer="Philip de Groot <Philip.deGroot at wur.nl>", version="2.1.0", species="Mus_musculus")
>>>
>>> Both CDF-files are available in a single zip-file via sendit:
>>> https://sendit.wur.nl/Download.aspx?id=9299e472-1d5a-4f83-af75-c10aa1faf94c
>>>
>>> Does somebody has an explanation for this? Is it a bug in writeCDF?
>>>
>>> Regards,
>>>
>>> Dr. Philip de Groot Ph.D.
>>> Bioinformatics Researcher
>>>
>>> Wageningen University / TIFN
>>> Nutrigenomics Consortium
>>> Nutrition, Metabolism & Genomics Group
>>> Division of Human Nutrition
>>> PO Box 8129, 6700 EV Wageningen
>>> Visiting Address: Erfelijkheidsleer: De Valk, Building 304
>>> Dreijenweg 2, 6703 HA  Wageningen
>>> Room: 0052a
>>> T: +31-317-485786
>>> F: +31-317-483342
>>> E-mail:   Philip.deGroot at wur.nl<mailto:Philip.deGroot at wur.nl>
>>> Internet: http://www.nutrigenomicsconsortium.nl<http://www.nutrigenomicsconsortium.nl/>
>>>             http://humannutrition.wur.nl/
>>>             https://madmax.bioinformatics.nl/
>>>
>>> _______________________________________________
>>> Bioc-devel at stat.math.ethz.ch mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>> _______________________________________________
>> Bioc-devel at stat.math.ethz.ch mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>