[BioC] Pd info package affy 10K array

Thu Jun 26 18:42:34 CEST 2008

Hi Michael,

Michael Gormley wrote:
> I get an error when running the makePdInfoPackage function to make a PdInfo
> package for the 10K mapping array.  The output from the function reads:
> 
>> makePdInfoPackage(pkg,destDir=".")
> Creating package in ./pd.mapping10k.xba142
> loadUnitsByBatch took 22.86 sec
> loadAffyCsv took 2.79 sec
> Error in sqliteExecStatement(con, statement, bind.data) :
>   RS-DBI driver: (RS_SQLite_exec: could not execute: PRIMARY KEY must be
> unique)
> In addition: Warning messages:
> 1: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
> 2: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
> 3: In is.na(v) : is.na() applied to non-(list or vector) of type 'NULL'
> Timing stopped at: 0.36 0.01 0.44

I have spent some time looking at this, and it appears that the problem 
is due to inconsistencies between the cdf and probe sequence files. As 
far as I can tell there are many probe locations ((x, y) coordinates) in 
the cdf that don't exist in the probe sequence file, and vice versa.

The function loadAffySeqCsv() reads in a chunk of data from the probe 
sequence file, then matches the indices (computed from the (x, y) 
coordinates) of these data with the indices that were generated using 
the cdf data. In the first chunk of 1000 probesets, there are only 8223 
probesets that match between the two data sources. I don't think this 
would normally be a problem, except for the fact that 1000 probesets 
from the sequence file should *exactly* line up with what we got from 
the cdf.

But the real problem that arises is this:

The computation of indices is based on the dimensions of the chip. If we 
query the cdf to find what the dimensions are we get this:

readCdfHeader(cdfFile)
$ncols
[1] 658

$nrows
[1] 658

So we compute the indices thus:

index <- x + 1 + y * ncols

This will give unique indices for all (x, y) coordinates on the chip, 
assuming we agree that the dimensions of the chip are 658 x 658. 
However, the sequence file doesn't agree:

pmdf[pmdf$fid == 9264,]
          fset.name   x  y offset                       seq tstrand type 
tallele
7077 SNP_A-1507675 709 13      0 TGCCCTGAATGTTTCAGCACATCTA       r   PM 
       T
       fid
7077 9264

The above is one line from the first 1000 probesets. Note that the (x, 
y) coordinates are (709, 13)! When we calculate the index (fid) we get 
9264. Unfortunately, if we use (51, 14) we also get 9264. Because the 
sequence file isn't playing by the rules, we end up with a total of 25 
duplicate indices. Since the index values are the primary key for the 
table we are trying to populate we get an error because you can't have 
duplicated primary keys.

So long story short, the sequence file for this chip is broken - the 
apparent maximum (x, y) coordinate is (710, 707) which is well beyond 
what the cdf claims. Or maybe the cdf is broken - I don't really know. 
The end result is that this will never work until Affy comes up with 
some consistent information for the chip.

Best,

Jim

> 
>> traceback()
> 12: .Call("RS_SQLite_exec", conId, statement, bind.data, PACKAGE =
> .SQLitePkgName)
> 11: sqliteExecStatement(con, statement, bind.data)
> 10: sqliteQuickSQL(conn, statement, bind.data, ...)
> 9: dbGetPreparedQuery(db, sql, bind.data = mmdf)
> 8: dbGetPreparedQuery(db, sql, bind.data = mmdf)
> 7: loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size)
> 6: eval(expr, envir, enclos)
> 5: eval(expr, envir = loc.frame)
> 4: ST(loadAffySeqCsv(db, csvSeqFile, cdfFile, batch_size = batch_size))
> 3: buildPdInfoDb(object at cdfFile, object at csvAnnoFile, object at csvSeqFile,
>        dbFilePath, seqMatFile, batch_size = batch_size, verbose = !quiet)
> 2: makePdInfoPackage(pkg, destDir = ".")
> 1: makePdInfoPackage(pkg, destDir = ".")
> 
> I noticed a prior post that suggested that this may be due to entering a
> record into a table with a Feature ID that is already in the table.  Is this
> the case?  Is there a work-around here?
> 
> Thanks,
> Mike Gormley
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
Affymetrix and cDNA Microarray Core
University of Michigan Cancer Center
1500 E. Medical Center Drive
7410 CCGC
Ann Arbor MI 48109
734-647-5623