[BioC] Pd info package affy 10K array
James W. MacDonald
jmacdon at med.umich.edu
Thu Jun 26 18:42:34 CEST 2008
Hi Michael,
Michael Gormley wrote:
> I get an error when running the makePdInfoPackage function to make a PdInfo
> package for the 10K mapping array. The output from the function reads:
>
>> makePdInfoPackage(pkg,destDir=".")
> Creating package in ./pd.mapping10k.xba142
> loadUnitsByBatch took 22.86 sec
> loadAffyCsv took 2.79 sec
> Error in sqliteExecStatement(con, statement, bind.data) :
> RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be
> unique)
> In addition: Warning messages:
> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
> Timing stopped at: 0.36 0.01 0.44
I have spent some time looking at this, and it appears that the problem
is due to inconsistencies between the cdf and probe sequence files. As
far as I can tell there are many probe locations ((x, y) coordinates) in
the cdf that don't exist in the probe sequence file, and vice versa.
The function loadAffySeqCsv() reads in a chunk of data from the probe
sequence file, then matches the indices (computed from the (x, y)
coordinates) of these data with the indices that were generated using
the cdf data. In the first chunk of 1000 probesets, there are only 8223
probesets that match between the two data sources. I don't think this
would normally be a problem, except for the fact that 1000 probesets
from the sequence file should *exactly* line up with what we got from
the cdf.
But the real problem that arises is this:
The computation of indices is based on the dimensions of the chip. If we
query the cdf to find what the dimensions are we get this:
readCdfHeader(cdfFile)
$ncols
[1] 658
$nrows
[1] 658
So we compute the indices thus:
index <- x + 1 + y * ncols
This will give unique indices for all (x, y) coordinates on the chip,
assuming we agree that the dimensions of the chip are 658 x 658.
However, the sequence file doesn't agree:
pmdf[pmdf$fid == 9264,]
fset.name x y offset seq tstrand type
tallele
7077 SNP_A-1507675 709 13 0 TGCCCTGAATGTTTCAGCACATCTA r PM
T
fid
7077 9264
The above is one line from the first 1000 probesets. Note that the (x,
y) coordinates are (709, 13)! When we calculate the index (fid) we get
9264. Unfortunately, if we use (51, 14) we also get 9264. Because the
sequence file isn't playing by the rules, we end up with a total of 25
duplicate indices. Since the index values are the primary key for the
table we are trying to populate we get an error because you can't have
duplicated primary keys.
So long story short, the sequence file for this chip is broken - the
apparent maximum (x, y) coordinate is (710, 707) which is well beyond
what the cdf claims. Or maybe the cdf is broken - I don't really know.
The end result is that this will never work until Affy comes up with
some consistent information for the chip.
Best,
Jim
>
>> traceback()
> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE =
> .SQLitePkgName)
> 11: sqliteExecStatement(con, statement, bind.data)
> 10: sqliteQuickSQL(conn, statement, bind.data, ...)
> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf)
> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf)
> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)
> 6: eval(expr, envir, enclos)
> 5: eval(expr, envir = loc.frame)
> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size))
> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile,
> dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet)
> 2: makePdInfoPackage(pkg, destDir = ".")
> 1: makePdInfoPackage(pkg, destDir = ".")
>
> I noticed a prior post that suggested that this may be due to entering a
> record into a table with a Feature ID that is already in the table. Is this
> the case? Is there a work-around here?
>
> Thanks,
> Mike Gormley
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623
More information about the Bioconductor
mailing list