[BioC] Pd info package affy 10K array

Thu Jun 26 20:26:12 CEST 2008

Interesting.

To test the problems Michael was having, I simply went to Affy's product 
support page and downloaded the library file, annotation file, and 
sequence file. So it appears they have things mixed up on that page, and 
there isn't anything obvious about the sequence file that would inform 
anybody it is wrong:

 > dir(pattern = "^Mapping")
[1] "Mapping10K_probe_tab"             "Mapping10K_Xba142.CDF"
[3] "Mapping10K_Xba142.na25.annot.csv"

Best,

Jim

Henrik Bengtsson wrote:
> Note that there are two different Affymetrix 10K chip types, namely
> Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142 (aka
> 'Mapping 10K Array 2.0').  The probe sequence file you refer to seems
> to be for the former, which is a larger chip.  Details on the official
> Affymetrix CDFs (converted to binary though):
> 
>> library(aroma.affymetrix)
>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142")
>> cdf
> AffymetrixCdfFile:
> Path: annotationData/chipTypes/Mapping10K_Xba142
> Filename: Mapping10K_Xba142.cdf
> Filesize: 9.53MB
> Chip type: Mapping10K_Xba142
> RAM: 0.00MB
> File format: v4 (binary; XDA)
> Dimension: 658x658
> Number of cells: 432964
> Number of units: 10208
> Cells per unit: 42.41
> Number of QC units: 9
> 
>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131")
>> cdf
> AffymetrixCdfFile:
> Path: annotationData/chipTypes/Mapping10K_Xba131
> Filename: Mapping10K_Xba131.cdf
> Filesize: 10.79MB
> Chip type: Mapping10K_Xba131
> RAM: 0.00MB
> File format: v4 (binary; XDA)
> Dimension: 712x712
> Number of cells: 506944
> Number of units: 11564
> Cells per unit: 43.84
> Number of QC units: 9
> 
> FYI: I try to collect information about various Affymetrix chip types at:
> 
>   http://groups.google.com/group/aroma-affymetrix/web/documentation-on-chip-types
> 
> Final comment: I would like to emphasize the difference between 'chip
> type' and 'CDF'; a chip type refers to a unique product coming out of
> Affymetrix, whereas a CDF refers to an annotation of a chip type.
> There can be many different CDFs for each chip type, but only one chip
> type per CDF.
> 
> Cheers
> 
> Henrik
> 
> On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald
> <jmacdon at med.umich.edu> wrote:
>> Hi Michael,
>>
>> Michael Gormley wrote:
>>> I get an error when running the makePdInfoPackage function to make a
>>> PdInfo
>>> package for the 10K mapping array.  The output from the function reads:
>>>
>>>> makePdInfoPackage(pkg,destDir=".")
>>> Creating package in ./pd.mapping10k.xba142
>>> loadUnitsByBatch took 22.86 sec
>>> loadAffyCsv took 2.79 sec
>>> Error in sqliteExecStatement(con, statement, bind.data) :
>>>  RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be
>>> unique)
>>> In addition: Warning messages:
>>> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>>> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>>> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>>> Timing stopped at: 0.36 0.01 0.44
>> I have spent some time looking at this, and it appears that the problem is
>> due to inconsistencies between the cdf and probe sequence files. As far as I
>> can tell there are many probe locations ((x, y) coordinates) in the cdf that
>> don't exist in the probe sequence file, and vice versa.
>>
>> The function loadAffySeqCsv() reads in a chunk of data from the probe
>> sequence file, then matches the indices (computed from the (x, y)
>> coordinates) of these data with the indices that were generated using the
>> cdf data. In the first chunk of 1000 probesets, there are only 8223
>> probesets that match between the two data sources. I don't think this would
>> normally be a problem, except for the fact that 1000 probesets from the
>> sequence file should *exactly* line up with what we got from the cdf.
>>
>> But the real problem that arises is this:
>>
>> The computation of indices is based on the dimensions of the chip. If we
>> query the cdf to find what the dimensions are we get this:
>>
>> readCdfHeader(cdfFile)
>> $ncols
>> [1] 658
>>
>> $nrows
>> [1] 658
>>
>> So we compute the indices thus:
>>
>> index <- x + 1 + y * ncols
>>
>> This will give unique indices for all (x, y) coordinates on the chip,
>> assuming we agree that the dimensions of the chip are 658 x 658. However,
>> the sequence file doesn't agree:
>>
>> pmdf[pmdf$fid == 9264,]
>>         fset.name   x  y offset                       seq tstrand type
>> tallele
>> 7077 SNP_A-1507675 709 13      0 TGCCCTGAATGTTTCAGCACATCTA       r   PM
>>  T
>>      fid
>> 7077 9264
>>
>> The above is one line from the first 1000 probesets. Note that the (x, y)
>> coordinates are (709, 13)! When we calculate the index (fid) we get 9264.
>> Unfortunately, if we use (51, 14) we also get 9264. Because the sequence
>> file isn't playing by the rules, we end up with a total of 25 duplicate
>> indices. Since the index values are the primary key for the table we are
>> trying to populate we get an error because you can't have duplicated primary
>> keys.
>>
>> So long story short, the sequence file for this chip is broken - the
>> apparent maximum (x, y) coordinate is (710, 707) which is well beyond what
>> the cdf claims. Or maybe the cdf is broken - I don't really know. The end
>> result is that this will never work until Affy comes up with some consistent
>> information for the chip.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>>
>>>> traceback()
>>> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE =
>>> .SQLitePkgName)
>>> 11: sqliteExecStatement(con, statement, bind.data)
>>> 10: sqliteQuickSQL(conn, statement, bind.data, ...)
>>> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>>> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>>> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)
>>> 6: eval(expr, envir, enclos)
>>> 5: eval(expr, envir = loc.frame)
>>> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size))
>>> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile,
>>>       dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet)
>>> 2: makePdInfoPackage(pkg, destDir = ".")
>>> 1: makePdInfoPackage(pkg, destDir = ".")
>>>
>>> I noticed a prior post that suggested that this may be due to entering a
>>> record into a table with a Feature ID that is already in the table.  Is
>>> this
>>> the case?  Is there a work-around here?
>>>
>>> Thanks,
>>> Mike Gormley
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> Affymetrix and cDNA Microarray Core
>> University of Michigan Cancer Center
>> 1500 E. Medical Center Drive
>> 7410 CCGC
>> Ann Arbor MI 48109
>> 734-647-5623
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623