[BioC] Pd info package affy 10K array
Henrik Bengtsson
hb at stat.berkeley.edu
Mon Jun 30 20:49:12 CEST 2008
Hi,
I can confirm that the probe sequence file for Mapping10K_Xba142
[http://www.affymetrix.com/Auth/analysis/downloads/data/Mapping10Kv2_probe_tab.zip]
linked to at the 'Mapping 10K 2.0 Array - Support Materials' page
[http://www.affymetrix.com/support/technical/byproduct.affx?product=10k-20]
does indeed look like it is for Mapping10K_Xba131, e.g. the available
X and Y positions are in [1,710] and [1,707] which is clearly outside
the dimension of the Mapping10K_Xba142 chip type 658x658.
Did you post this in the Affymetrix Forum
https://www.affymetrix.com/community/forums/index.jspa
or directly to the support? Is there a thread where I can post a follow up?
-Henrik
On Thu, Jun 26, 2008 at 2:25 PM, Michael Gormley
<michael.gormley at gmail.com> wrote:
> This is the same source where I obtained the files originally. I have
> brought this issue to the attention of affy technical support. Hoping they
> can get me the correct probe sequence file.
>
> On Thu, Jun 26, 2008 at 2:26 PM, James W. MacDonald <jmacdon at med.umich.edu>
> wrote:
>>
>> Interesting.
>>
>> To test the problems Michael was having, I simply went to Affy's product
>> support page and downloaded the library file, annotation file, and sequence
>> file. So it appears they have things mixed up on that page, and there isn't
>> anything obvious about the sequence file that would inform anybody it is
>> wrong:
>>
>> > dir(pattern = "^Mapping")
>> [1] "Mapping10K_probe_tab" "Mapping10K_Xba142.CDF"
>> [3] "Mapping10K_Xba142.na25.annot.csv"
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> Henrik Bengtsson wrote:
>>>
>>> Note that there are two different Affymetrix 10K chip types, namely
>>> Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142 (aka
>>> 'Mapping 10K Array 2.0'). The probe sequence file you refer to seems
>>> to be for the former, which is a larger chip. Details on the official
>>> Affymetrix CDFs (converted to binary though):
>>>
>>>> library(aroma.affymetrix)
>>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142")
>>>> cdf
>>>
>>> AffymetrixCdfFile:
>>> Path: annotationData/chipTypes/Mapping10K_Xba142
>>> Filename: Mapping10K_Xba142.cdf
>>> Filesize: 9.53MB
>>> Chip type: Mapping10K_Xba142
>>> RAM: 0.00MB
>>> File format: v4 (binary; XDA)
>>> Dimension: 658x658
>>> Number of cells: 432964
>>> Number of units: 10208
>>> Cells per unit: 42.41
>>> Number of QC units: 9
>>>
>>>> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131")
>>>> cdf
>>>
>>> AffymetrixCdfFile:
>>> Path: annotationData/chipTypes/Mapping10K_Xba131
>>> Filename: Mapping10K_Xba131.cdf
>>> Filesize: 10.79MB
>>> Chip type: Mapping10K_Xba131
>>> RAM: 0.00MB
>>> File format: v4 (binary; XDA)
>>> Dimension: 712x712
>>> Number of cells: 506944
>>> Number of units: 11564
>>> Cells per unit: 43.84
>>> Number of QC units: 9
>>>
>>> FYI: I try to collect information about various Affymetrix chip types at:
>>>
>>>
>>> http://groups.google.com/group/aroma-affymetrix/web/documentation-on-chip-types
>>>
>>> Final comment: I would like to emphasize the difference between 'chip
>>> type' and 'CDF'; a chip type refers to a unique product coming out of
>>> Affymetrix, whereas a CDF refers to an annotation of a chip type.
>>> There can be many different CDFs for each chip type, but only one chip
>>> type per CDF.
>>>
>>> Cheers
>>>
>>> Henrik
>>>
>>> On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald
>>> <jmacdon at med.umich.edu> wrote:
>>>>
>>>> Hi Michael,
>>>>
>>>> Michael Gormley wrote:
>>>>>
>>>>> I get an error when running the makePdInfoPackage function to make a
>>>>> PdInfo
>>>>> package for the 10K mapping array. The output from the function reads:
>>>>>
>>>>>> makePdInfoPackage(pkg,destDir=".")
>>>>>
>>>>> Creating package in ./pd.mapping10k.xba142
>>>>> loadUnitsByBatch took 22.86 sec
>>>>> loadAffyCsv took 2.79 sec
>>>>> Error in sqliteExecStatement(con, statement, bind.data) :
>>>>> RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be
>>>>> unique)
>>>>> In addition: Warning messages:
>>>>> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>>>>> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>>>>> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>>>>> Timing stopped at: 0.36 0.01 0.44
>>>>
>>>> I have spent some time looking at this, and it appears that the problem
>>>> is
>>>> due to inconsistencies between the cdf and probe sequence files. As far
>>>> as I
>>>> can tell there are many probe locations ((x, y) coordinates) in the cdf
>>>> that
>>>> don't exist in the probe sequence file, and vice versa.
>>>>
>>>> The function loadAffySeqCsv() reads in a chunk of data from the probe
>>>> sequence file, then matches the indices (computed from the (x, y)
>>>> coordinates) of these data with the indices that were generated using
>>>> the
>>>> cdf data. In the first chunk of 1000 probesets, there are only 8223
>>>> probesets that match between the two data sources. I don't think this
>>>> would
>>>> normally be a problem, except for the fact that 1000 probesets from the
>>>> sequence file should *exactly* line up with what we got from the cdf.
>>>>
>>>> But the real problem that arises is this:
>>>>
>>>> The computation of indices is based on the dimensions of the chip. If we
>>>> query the cdf to find what the dimensions are we get this:
>>>>
>>>> readCdfHeader(cdfFile)
>>>> $ncols
>>>> [1] 658
>>>>
>>>> $nrows
>>>> [1] 658
>>>>
>>>> So we compute the indices thus:
>>>>
>>>> index <- x + 1 + y * ncols
>>>>
>>>> This will give unique indices for all (x, y) coordinates on the chip,
>>>> assuming we agree that the dimensions of the chip are 658 x 658.
>>>> However,
>>>> the sequence file doesn't agree:
>>>>
>>>> pmdf[pmdf$fid == 9264,]
>>>> fset.name x y offset seq tstrand type
>>>> tallele
>>>> 7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA r PM
>>>> T
>>>> fid
>>>> 7077 9264
>>>>
>>>> The above is one line from the first 1000 probesets. Note that the (x,
>>>> y)
>>>> coordinates are (709, 13)! When we calculate the index (fid) we get
>>>> 9264.
>>>> Unfortunately, if we use (51, 14) we also get 9264. Because the sequence
>>>> file isn't playing by the rules, we end up with a total of 25 duplicate
>>>> indices. Since the index values are the primary key for the table we are
>>>> trying to populate we get an error because you can't have duplicated
>>>> primary
>>>> keys.
>>>>
>>>> So long story short, the sequence file for this chip is broken - the
>>>> apparent maximum (x, y) coordinate is (710, 707) which is well beyond
>>>> what
>>>> the cdf claims. Or maybe the cdf is broken - I don't really know. The
>>>> end
>>>> result is that this will never work until Affy comes up with some
>>>> consistent
>>>> information for the chip.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>>
>>>>>> traceback()
>>>>>
>>>>> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE =
>>>>> .SQLitePkgName)
>>>>> 11: sqliteExecStatement(con, statement, bind.data)
>>>>> 10: sqliteQuickSQL(conn, statement, bind.data, ...)
>>>>> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>>>>> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>>>>> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)
>>>>> 6: eval(expr, envir, enclos)
>>>>> 5: eval(expr, envir = loc.frame)
>>>>> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size))
>>>>> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile,
>>>>> dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet)
>>>>> 2: makePdInfoPackage(pkg, destDir = ".")
>>>>> 1: makePdInfoPackage(pkg, destDir = ".")
>>>>>
>>>>> I noticed a prior post that suggested that this may be due to entering
>>>>> a
>>>>> record into a table with a Feature ID that is already in the table. Is
>>>>> this
>>>>> the case? Is there a work-around here?
>>>>>
>>>>> Thanks,
>>>>> Mike Gormley
>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> --
>>>> James W. MacDonald, M.S.
>>>> Biostatistician
>>>> Affymetrix and cDNA Microarray Core
>>>> University of Michigan Cancer Center
>>>> 1500 E. Medical Center Drive
>>>> 7410 CCGC
>>>> Ann Arbor MI 48109
>>>> 734-647-5623
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>
>> --
>> James W. MacDonald, M.S.
>> Biostatistician
>> Affymetrix and cDNA Microarray Core
>> University of Michigan Cancer Center
>> 1500 E. Medical Center Drive
>> 7410 CCGC
>> Ann Arbor MI 48109
>> 734-647-5623
>
>
More information about the Bioconductor
mailing list