[BioC] Pd info package affy 10K array

Henrik Bengtsson hb at stat.berkeley.edu
Tue Jul 1 22:15:19 CEST 2008


Hi,

FYI and related to this one, I've posted a 'Request for more
consistent filenames for chip type files' to the "General" forum of
the Affymetrix Developers Network, cf.
http://www.affymetrix.com/community/forums/thread.jspa?threadID=6481.

/Henrik


On Mon, Jun 30, 2008 at 11:49 AM, Henrik Bengtsson <hb at stat.berkeley.edu> wrote:
> Hi,
>
> I can confirm that the probe sequence file for Mapping10K_Xba142
> [http://www.affymetrix.com/Auth/analysis/downloads/data/Mapping10Kv2_probe_tab.zip]
> linked to at the 'Mapping 10K 2.0 Array - Support Materials' page
> [http://www.affymetrix.com/support/technical/byproduct.affx?product=10k-20]
> does indeed look like it is for Mapping10K_Xba131, e.g. the available
> X and Y positions are in [1,710] and [1,707] which is clearly outside
> the dimension of the Mapping10K_Xba142 chip type 658x658.
>
> Did you post this in the Affymetrix Forum
>
>  https://www.affymetrix.com/community/forums/index.jspa
>
> or directly to the support?  Is there a thread where I can post a follow up?
>
> -Henrik
>
>
> On Thu, Jun 26, 2008 at 2:25 PM, Michael Gormley
> <michael.gormley at gmail.com> wrote:
>> This is the same source where I obtained the files originally.  I have
>> brought this issue to the attention of affy technical support.  Hoping they
>> can get me the correct probe sequence file.
>>
>> On Thu, Jun 26, 2008 at 2:26 PM, James W. MacDonald <jmacdon at med.umich.edu>
>> wrote:
>>>
>>> Interesting.
>>>
>>> To test the problems Michael was having, I simply went to Affy's product
>>> support page and downloaded the library file, annotation file, and sequence
>>> file. So it appears they have things mixed up on that page, and there isn't
>>> anything obvious about the sequence file that would inform anybody it is
>>> wrong:
>>>
>>> > dir(pattern = "^Mapping")
>>> [1] "Mapping10K_probe_tab"             "Mapping10K_Xba142.CDF"
>>> [3] "Mapping10K_Xba142.na25.annot.csv"
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>> Henrik Bengtsson wrote:
>>>>
>>>> Note that there are two different Affymetrix 10K chip types, namely
>>>> Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142 (aka
>>>> 'Mapping 10K Array 2.0').  The probe sequence file you refer to seems
>>>> to be for the former, which is a larger chip.  Details on the official
>>>> Affymetrix CDFs (converted to binary though):
>>>>
>>>>> library(aroma.affymetrix)
>>>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142")
>>>>> cdf
>>>>
>>>> AffymetrixCdfFile:
>>>> Path: annotationData/chipTypes/Mapping10K_Xba142
>>>> Filename: Mapping10K_Xba142.cdf
>>>> Filesize: 9.53MB
>>>> Chip type: Mapping10K_Xba142
>>>> RAM: 0.00MB
>>>> File format: v4 (binary; XDA)
>>>> Dimension: 658x658
>>>> Number of cells: 432964
>>>> Number of units: 10208
>>>> Cells per unit: 42.41
>>>> Number of QC units: 9
>>>>
>>>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131")
>>>>> cdf
>>>>
>>>> AffymetrixCdfFile:
>>>> Path: annotationData/chipTypes/Mapping10K_Xba131
>>>> Filename: Mapping10K_Xba131.cdf
>>>> Filesize: 10.79MB
>>>> Chip type: Mapping10K_Xba131
>>>> RAM: 0.00MB
>>>> File format: v4 (binary; XDA)
>>>> Dimension: 712x712
>>>> Number of cells: 506944
>>>> Number of units: 11564
>>>> Cells per unit: 43.84
>>>> Number of QC units: 9
>>>>
>>>> FYI: I try to collect information about various Affymetrix chip types at:
>>>>
>>>>
>>>>  http://groups.google.com/group/aroma-affymetrix/web/documentation-on-chip-types
>>>>
>>>> Final comment: I would like to emphasize the difference between 'chip
>>>> type' and 'CDF'; a chip type refers to a unique product coming out of
>>>> Affymetrix, whereas a CDF refers to an annotation of a chip type.
>>>> There can be many different CDFs for each chip type, but only one chip
>>>> type per CDF.
>>>>
>>>> Cheers
>>>>
>>>> Henrik
>>>>
>>>> On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald
>>>> <jmacdon at med.umich.edu> wrote:
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> Michael Gormley wrote:
>>>>>>
>>>>>> I get an error when running the makePdInfoPackage function to make a
>>>>>> PdInfo
>>>>>> package for the 10K mapping array.  The output from the function reads:
>>>>>>
>>>>>>> makePdInfoPackage(pkg,destDir=".")
>>>>>>
>>>>>> Creating package in ./pd.mapping10k.xba142
>>>>>> loadUnitsByBatch took 22.86 sec
>>>>>> loadAffyCsv took 2.79 sec
>>>>>> Error in sqliteExecStatement(con, statement, bind.data) :
>>>>>>  RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be
>>>>>> unique)
>>>>>> In addition: Warning messages:
>>>>>> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>>>>>> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>>>>>> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>>>>>> Timing stopped at: 0.36 0.01 0.44
>>>>>
>>>>> I have spent some time looking at this, and it appears that the problem
>>>>> is
>>>>> due to inconsistencies between the cdf and probe sequence files. As far
>>>>> as I
>>>>> can tell there are many probe locations ((x, y) coordinates) in the cdf
>>>>> that
>>>>> don't exist in the probe sequence file, and vice versa.
>>>>>
>>>>> The function loadAffySeqCsv() reads in a chunk of data from the probe
>>>>> sequence file, then matches the indices (computed from the (x, y)
>>>>> coordinates) of these data with the indices that were generated using
>>>>> the
>>>>> cdf data. In the first chunk of 1000 probesets, there are only 8223
>>>>> probesets that match between the two data sources. I don't think this
>>>>> would
>>>>> normally be a problem, except for the fact that 1000 probesets from the
>>>>> sequence file should *exactly* line up with what we got from the cdf.
>>>>>
>>>>> But the real problem that arises is this:
>>>>>
>>>>> The computation of indices is based on the dimensions of the chip. If we
>>>>> query the cdf to find what the dimensions are we get this:
>>>>>
>>>>> readCdfHeader(cdfFile)
>>>>> $ncols
>>>>> [1] 658
>>>>>
>>>>> $nrows
>>>>> [1] 658
>>>>>
>>>>> So we compute the indices thus:
>>>>>
>>>>> index <- x + 1 + y * ncols
>>>>>
>>>>> This will give unique indices for all (x, y) coordinates on the chip,
>>>>> assuming we agree that the dimensions of the chip are 658 x 658.
>>>>> However,
>>>>> the sequence file doesn't agree:
>>>>>
>>>>> pmdf[pmdf$fid == 9264,]
>>>>>        fset.name   x  y offset                       seq tstrand type
>>>>> tallele
>>>>> 7077 SNP_A-1507675 709 13      0 TGCCCTGAATGTTTCAGCACATCTA       r   PM
>>>>>  T
>>>>>     fid
>>>>> 7077 9264
>>>>>
>>>>> The above is one line from the first 1000 probesets. Note that the (x,
>>>>> y)
>>>>> coordinates are (709, 13)! When we calculate the index (fid) we get
>>>>> 9264.
>>>>> Unfortunately, if we use (51, 14) we also get 9264. Because the sequence
>>>>> file isn't playing by the rules, we end up with a total of 25 duplicate
>>>>> indices. Since the index values are the primary key for the table we are
>>>>> trying to populate we get an error because you can't have duplicated
>>>>> primary
>>>>> keys.
>>>>>
>>>>> So long story short, the sequence file for this chip is broken - the
>>>>> apparent maximum (x, y) coordinate is (710, 707) which is well beyond
>>>>> what
>>>>> the cdf claims. Or maybe the cdf is broken - I don't really know. The
>>>>> end
>>>>> result is that this will never work until Affy comes up with some
>>>>> consistent
>>>>> information for the chip.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>> traceback()
>>>>>>
>>>>>> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE =
>>>>>> .SQLitePkgName)
>>>>>> 11: sqliteExecStatement(con, statement, bind.data)
>>>>>> 10: sqliteQuickSQL(conn, statement, bind.data, ...)
>>>>>> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>>>>>> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>>>>>> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)
>>>>>> 6: eval(expr, envir, enclos)
>>>>>> 5: eval(expr, envir = loc.frame)
>>>>>> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size))
>>>>>> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile,
>>>>>>      dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet)
>>>>>> 2: makePdInfoPackage(pkg, destDir = ".")
>>>>>> 1: makePdInfoPackage(pkg, destDir = ".")
>>>>>>
>>>>>> I noticed a prior post that suggested that this may be due to entering
>>>>>> a
>>>>>> record into a table with a Feature ID that is already in the table.  Is
>>>>>> this
>>>>>> the case?  Is there a work-around here?
>>>>>>
>>>>>> Thanks,
>>>>>> Mike Gormley
>>>>>>
>>>>>>       [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> --
>>>>> James W. MacDonald, M.S.
>>>>> Biostatistician
>>>>> Affymetrix and cDNA Microarray Core
>>>>> University of Michigan Cancer Center
>>>>> 1500 E. Medical Center Drive
>>>>> 7410 CCGC
>>>>> Ann Arbor MI 48109
>>>>> 734-647-5623
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>
>>> --
>>> James W. MacDonald, M.S.
>>> Biostatistician
>>> Affymetrix and cDNA Microarray Core
>>> University of Michigan Cancer Center
>>> 1500 E. Medical Center Drive
>>> 7410 CCGC
>>> Ann Arbor MI 48109
>>> 734-647-5623
>>
>>
>



More information about the Bioconductor mailing list