[BioC] pd.hugene.2.0.st missing normgene->exon mappings

Tue Jul 9 20:06:21 CEST 2013

Dear Jim,

I did not really follow the discussion so I may be wrong, but if you 
mean that there is a difference between the number of 'main' types, 
please note that number of 'main' for pgf, i.e 349012, corresponds to 
the number of 'main' in the probeset annotation file and not in the 
transcript annotation file.

But as I said, I may have misunderstood the problem.

I am mainly replying because at the beginning of this year I had long 
discussions with DevNet to make sure that the annotation files for the 
2.X arrays are correct, and in version na33.2 DevNet did correct 
everything what I have found.

Best regards,
Christian

On 7/9/13 7:13 PM, James W. MacDonald wrote:
> Hi Mark,
>
> Thanks for the heads-up. We already knew that Affy messed up the
> transcript and probeset annotation files (and had them fixed), but
> didn't think I needed to check the others. Famous last words, no?
>
>  > x <- readPgf("HuGene-2_0-st.pgf")
>  > table(x$probesetType)
>
>               control->affx   control->affx->bac_spike
>                          18                         18
>   control->affx->ercc_spike control->affx->polya_spike
>                          92                         39
>   control->bgp->antigenomic                       main
>                          23                     349012
>            normgene->intron                   reporter
>                        3575                         82
>
>  > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv",
> comment.char = "#", stringsAsFactors=FALSE, header = TRUE)
>  > table(y$category)
>
>               control->affx   control->affx->bac_spike
>                          18                         18
>   control->affx->ercc-spike control->affx->polya_spike
>                          92                         39
>   control->bgp->antigenomic                       main
>                          23                      44629
>              normgene->exon           normgene->intron
>                        1626                       3575
>                    reporter                     rescue
>                          82                       3515
>
> I'll ping Affymetrix and see what they have to say.
>
> Best,
>
> Jim
>
>
>
> On 7/9/2013 3:29 AM, Mark Cowley wrote:
>> Dear Benilton, James&  Bioconductors,
>> Thanks for providing the platform design packages for
>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays.
>>
>> I use SQL to query these packages&  ultimately retain only 'main'
>> probes in my analysis. This works well for 1.0 and 1.1 packages, but
>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the
>> normgene->exon control probes are misclassified as 'main' probes.
>>
>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv
>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st
>> package lists 0, and assigns these 1626 probes to the 'main' category:
>>
>> # probe types:
>> library(pd.hugene.2.0.st)
>> conn<- db(pd.hugene.2.0.st)
>> dbGetQuery(conn,"SELECT * from type_dict")
>>     type                   type_id
>> 1     1                      main
>> 2     2             control->affx
>> 3     3             control->chip
>> 4     4 control->bgp->antigenomic
>> 5     5     control->bgp->genomic
>> 6     6            normgene->exon
>> 7     7          normgene->intron
>> 8     8  rescue->FLmRNA->unmapped
>> 9     9  control->affx->bac_spike
>> 10   10            oligo_spike_in
>> 11   11           r1_bac_spike_at
>>
>> # probe counts for each of the probe categories:
>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY type")
>>    type count(*)
>> 1   NA     3728
>> 2    1   345497
>> 3    2       18
>> 4    4       23
>> 5    7     3575
>> 6    9       18
>>
>> NB: no type 6 probes.
>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see
>> this issue for all 2.0 and 2.1 arrays (see below)
>>
>> Can these mappings please be updated?
>>
>> PS, there's a bunch of probes with type = NA in the database. I
>> haven't investigated these in any detail.
>>
>> cheers,
>> Mark
>> -----------------------------------------------------
>> Mark Cowley, PhD
>>
>> Genome Informatics Division&  the Centre for Clinical Genomics
>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research,
>> Sydney, Australia
>> -----------------------------------------------------
>>
>> All 12 packages below:
>> pd.packages<- c(
>>    "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st",
>> "pd.hugene.2.1.st",
>>    "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st",
>> "pd.mogene.2.1.st",
>>    "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st",
>> "pd.ragene.2.1.st"
>> )
>>
>> a<- b<- list()
>> for(pd.pkg.name in pd.packages) {
>>    try({
>>      require(pd.pkg.name, character.only=TRUE) || stop("Can't load the
>> pd.package")
>>      conn<- db(get(pd.pkg.name))
>>      a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from
>> featureSet GROUP BY type")
>>      b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from featureSet
>> WHERE type = 6")[,1]
>>    })
>> }
>> dbGetQuery(conn,"SELECT * from type_dict")
>>
>>> a
>> $pd.hugene.1.0.st.v1
>>    type count(*)
>> 1   NA      227
>> 2    1   253002
>> 3    2       57
>> 4    4       45
>> 5    6     1195
>> 6    7     2904
>>
>> $pd.hugene.1.1.st.v1
>>    type count(*)
>> 1   NA      227
>> 2    1   253002
>> 3    2       57
>> 4    4       45
>> 5    6     1195
>> 6    7     2904
>>
>> $pd.hugene.2.0.st
>>    type count(*)
>> 1   NA     3728
>> 2    1   345497
>> 3    2       18
>> 4    4       23
>> 5    7     3575
>> 6    9       18
>>
>> $pd.hugene.2.1.st
>>    type count(*)
>> 1   NA     3728
>> 2    1   345497
>> 3    2       18
>> 4    4       23
>> 5    7     3575
>> 6    9       18
>>
>> $pd.mogene.1.0.st.v1
>>    type count(*)
>> 1   NA       86
>> 2    1   234878
>> 3    2       21
>> 4    4       45
>> 5    6     1324
>> 6    7     5222
>>
>> $pd.mogene.1.1.st.v1
>>    type count(*)
>> 1   NA       86
>> 2    1   234878
>> 3    2       21
>> 4    4       45
>> 5    6     1324
>> 6    7     5222
>>
>> $pd.mogene.2.0.st
>>    type count(*)
>> 1   NA      810
>> 2    1   263551
>> 3    2       18
>> 4    4       23
>> 5    7     5331
>> 6    9       18
>>
>> $pd.mogene.2.1.st
>>    type count(*)
>> 1   NA      810
>> 2    1   263551
>> 3    2       18
>> 4    4       23
>> 5    7     5331
>> 6    9       18
>>
>> $pd.ragene.1.0.st.v1
>>    type count(*)
>> 1   NA      254
>> 2    1   211195
>> 3    2       21
>> 4    4       45
>> 5    6      399
>> 6    7     1153
>>
>> $pd.ragene.1.1.st.v1
>>    type count(*)
>> 1   NA      254
>> 2    1   211195
>> 3    2       21
>> 4    4       45
>> 5    6      399
>> 6    7     1153
>>
>> $pd.ragene.2.0.st
>>    type count(*)
>> 1   NA     1071
>> 2    1   214018
>> 3    2       18
>> 4    4       23
>> 5    7     5083
>> 6    9       18
>>
>> $pd.ragene.2.1.st
>>    type count(*)
>> 1   NA     1071
>> 2    1   214018
>> 3    2       18
>> 4    4       23
>> 5    7     5083
>> 6    9       18
>>
>>> sapply(b,length)
>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1    pd.hugene.2.0.st
>> pd.hugene.2.1.st
>>                 1195                1195
>> 0                   0
>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1    pd.mogene.2.0.st
>> pd.mogene.2.1.st
>>                 1324                1324
>> 0                   0
>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1    pd.ragene.2.0.st
>> pd.ragene.2.1.st
>>                  399                 399
>> 0                   0
>>
>>> sessionInfo()
>> R version 3.0.0 (2013-04-03)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>>   [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>   [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
>>   [7] LC_PAPER=C                 LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>> [8] base
>>
>> other attached packages:
>>   [1] pd.ragene.2.1.st_2.12.1   pd.ragene.2.0.st_2.12.0
>>   [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0
>>   [5] pd.mogene.2.1.st_2.12.1   pd.mogene.2.0.st_2.12.0
>>   [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0
>>   [9] pd.hugene.2.1.st_3.8.0    pd.hugene.1.1.st.v1_3.8.0
>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0
>> [13] oligo_1.24.0              Biobase_2.20.0
>> [15] oligoClasses_1.22.0       BiocGenerics_0.6.0
>> [17] RSQLite_0.11.4            DBI_0.2-7
>> [19] BiocInstaller_1.10.2
>>
>> loaded via a namespace (and not attached):
>>   [1] affxparser_1.32.1     affyio_1.28.0         Biostrings_2.28.0
>>   [4] bit_1.1-10            codetools_0.2-8       ff_2.2-11
>>   [7] foreach_1.4.1         GenomicRanges_1.12.3  IRanges_1.18.1
>> [10] iterators_1.0.6       preprocessCore_1.22.0 splines_3.0.0
>> [13] stats4_3.0.0          tools_3.0.0           zlibbioc_1.6.0
>>
>>
>>
>>
>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>