[BioC] Pd info package affy 10K array
Henrik Bengtsson
hb at stat.berkeley.edu
Thu Jun 26 20:09:04 CEST 2008
Note that there are two different Affymetrix 10K chip types, namely
Mapping10K_Xba131 (aka 'Mapping 10K Array') and Mapping10K_Xba142 (aka
'Mapping 10K Array 2.0'). The probe sequence file you refer to seems
to be for the former, which is a larger chip. Details on the official
Affymetrix CDFs (converted to binary though):
> library(aroma.affymetrix)
> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba142")
> cdf
AffymetrixCdfFile:
Path: annotationData/chipTypes/Mapping10K_Xba142
Filename: Mapping10K_Xba142.cdf
Filesize: 9.53MB
Chip type: Mapping10K_Xba142
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 658x658
Number of cells: 432964
Number of units: 10208
Cells per unit: 42.41
Number of QC units: 9
> cdf <- AffymetrixCdfFile$byChipType("Mapping10K_Xba131")
> cdf
AffymetrixCdfFile:
Path: annotationData/chipTypes/Mapping10K_Xba131
Filename: Mapping10K_Xba131.cdf
Filesize: 10.79MB
Chip type: Mapping10K_Xba131
RAM: 0.00MB
File format: v4 (binary; XDA)
Dimension: 712x712
Number of cells: 506944
Number of units: 11564
Cells per unit: 43.84
Number of QC units: 9
FYI: I try to collect information about various Affymetrix chip types at:
http://groups.google.com/group/aroma-affymetrix/web/documentation-on-chip-types
Final comment: I would like to emphasize the difference between 'chip
type' and 'CDF'; a chip type refers to a unique product coming out of
Affymetrix, whereas a CDF refers to an annotation of a chip type.
There can be many different CDFs for each chip type, but only one chip
type per CDF.
Cheers
Henrik
On Thu, Jun 26, 2008 at 9:42 AM, James W. MacDonald
<jmacdon at med.umich.edu> wrote:
> Hi Michael,
>
> Michael Gormley wrote:
>>
>> I get an error when running the makePdInfoPackage function to make a
>> PdInfo
>> package for the 10K mapping array. The output from the function reads:
>>
>>> makePdInfoPackage(pkg,destDir=".")
>>
>> Creating package in ./pd.mapping10k.xba142
>> loadUnitsByBatch took 22.86 sec
>> loadAffyCsv took 2.79 sec
>> Error in sqliteExecStatement(con, statement, bind.data) :
>> RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be
>> unique)
>> In addition: Warning messages:
>> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
>> Timing stopped at: 0.36 0.01 0.44
>
> I have spent some time looking at this, and it appears that the problem is
> due to inconsistencies between the cdf and probe sequence files. As far as I
> can tell there are many probe locations ((x, y) coordinates) in the cdf that
> don't exist in the probe sequence file, and vice versa.
>
> The function loadAffySeqCsv() reads in a chunk of data from the probe
> sequence file, then matches the indices (computed from the (x, y)
> coordinates) of these data with the indices that were generated using the
> cdf data. In the first chunk of 1000 probesets, there are only 8223
> probesets that match between the two data sources. I don't think this would
> normally be a problem, except for the fact that 1000 probesets from the
> sequence file should *exactly* line up with what we got from the cdf.
>
> But the real problem that arises is this:
>
> The computation of indices is based on the dimensions of the chip. If we
> query the cdf to find what the dimensions are we get this:
>
> readCdfHeader(cdfFile)
> $ncols
> [1] 658
>
> $nrows
> [1] 658
>
> So we compute the indices thus:
>
> index <- x + 1 + y * ncols
>
> This will give unique indices for all (x, y) coordinates on the chip,
> assuming we agree that the dimensions of the chip are 658 x 658. However,
> the sequence file doesn't agree:
>
> pmdf[pmdf$fid == 9264,]
> fset.name x y offset seq tstrand type
> tallele
> 7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA r PM
> T
> fid
> 7077 9264
>
> The above is one line from the first 1000 probesets. Note that the (x, y)
> coordinates are (709, 13)! When we calculate the index (fid) we get 9264.
> Unfortunately, if we use (51, 14) we also get 9264. Because the sequence
> file isn't playing by the rules, we end up with a total of 25 duplicate
> indices. Since the index values are the primary key for the table we are
> trying to populate we get an error because you can't have duplicated primary
> keys.
>
> So long story short, the sequence file for this chip is broken - the
> apparent maximum (x, y) coordinate is (710, 707) which is well beyond what
> the cdf claims. Or maybe the cdf is broken - I don't really know. The end
> result is that this will never work until Affy comes up with some consistent
> information for the chip.
>
> Best,
>
> Jim
>
>
>
>
>>
>>> traceback()
>>
>> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE =
>> .SQLitePkgName)
>> 11: sqliteExecStatement(con, statement, bind.data)
>> 10: sqliteQuickSQL(conn, statement, bind.data, ...)
>> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf)
>> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)
>> 6: eval(expr, envir, enclos)
>> 5: eval(expr, envir = loc.frame)
>> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size))
>> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile,
>> dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet)
>> 2: makePdInfoPackage(pkg, destDir = ".")
>> 1: makePdInfoPackage(pkg, destDir = ".")
>>
>> I noticed a prior post that suggested that this may be due to entering a
>> record into a table with a Feature ID that is already in the table. Is
>> this
>> the case? Is there a work-around here?
>>
>> Thanks,
>> Mike Gormley
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> Affymetrix and cDNA Microarray Core
> University of Michigan Cancer Center
> 1500 E. Medical Center Drive
> 7410 CCGC
> Ann Arbor MI 48109
> 734-647-5623
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list