[BioC] pd.hugene.2.0.st missing normgene->exon mappings

cstrato cstrato at aon.at
Tue Jul 9 23:54:08 CEST 2013


Dear Jim,

As far as I understand it, at the transcript level 16934607 is on one 
hand part of the EIF3D transcript and on the other hand does serve as a 
positive control. To me this seems to be no contradiction, but probably 
only DevNet can explain.

Best regards,
Christian


On 7/9/13 11:46 PM, James W. MacDonald wrote:
> Hi Christian,
>
> Thanks for pointing that out. There is still a bit of an inconsistency
> with the pd packages that should probably be corrected, as at the
> probeset level e.g., 16934607 is intended to measure an exon of EIF3D,
> whereas at the transcript level, this same probeset is intended to be a
> positive control (and as you note below, these probes are incorporated
> into a larger probeset intended to measure the EIF3D transcript).
>
> It would be nice to be able to filter out the controls like Mark
> attempted (and I do regularly as well).
>
> Mark - I talked with Benilton Carvalho about this, and he will take a
> look next week.
>
> Best,
>
> Jim
>
>
>
> On 7/9/2013 3:38 PM, cstrato wrote:
>> Dear Jim,
>>
>> In xps I use as basic file for exon arrays the probeset annotation
>> file and then compare the data to the data from the pgf-file. Any
>> differences will be reported.
>>
>> I have just checked the different files for HuGene-2_0-st. If you
>> check as an example the following probeset_ids:
>> 16934607
>> 16934608
>> 16934609
>> 16934610
>>
>> Then you will see that the transcript annotation file lists these ids
>> as 'normgene->exon' and 'pos_control'. However, the probeset
>> annotation file lists these ids as 'main' belonging to gene EIF3D with
>> transcript_cluster_id 16934583. Looking for this id in the transcript
>> annotation file reveals that the number of 'total_probes' is 24.
>> Indeed, the probeset annotation file lists 24 probesets including the
>> four above mentioned probeset_ids.
>>
>> This means that although these four probesets are listed in the
>> transcript annotation file as 'normgene->exon' the label 'main' in the
>> pgf-file is correct since these probesets are part of the gene EIF3D.
>>
>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets
>> listed as 'normgene->exon'. However, in this case these probesets are
>> also listed as 'normgene->exon' in the probeset annotation file, i.e.
>> these probesets do not belong to any transcript listed in the probeset
>> annotation file.
>>
>> Best regards,
>> Christian
>>
>>
>> On 7/9/13 8:46 PM, James W. MacDonald wrote:
>>> Hi Christian,
>>>
>>> That's not the issue. Instead, the issue is that the pgf file lists the
>>> normgene->exon probeset IDs as being 'main'. I have received a response
>>> from Affy stating that the qcc file lists the normgene->exon probesets
>>> as pos_control, but that seems orthogonal to the issue at hand.
>>>
>>> > qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#",
>>> stringsAsFactors=F, header=T)
>>> > pgf <- readPgf("HuGene-2_0-st.pgf")
>>> > head(qcc)
>>>    probeset_id  group_name probeset_name quantification_in_header
>>> 1    16650001 neg_control      16650001                        0
>>> 2    16650003 neg_control      16650003                        0
>>>
>>> ## get the positive controls (normgene->exon probesets) from the qcc
>>> file
>>> > pos_cont <- qcc[qcc[,2] == "pos_control",1]
>>>
>>> ## compare to pgf file
>>> > x <- pgf$probesetType[pgf$probesetId %in% pos_cont]
>>> > table(x)
>>> x
>>> main
>>> 1626
>>>
>>> So in the pgf file, these probesets are being called 'main' instead of
>>> some sort of control. How do you handle this in xps? Do you use the pgf
>>> file?
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>>
>>> On 7/9/2013 2:06 PM, cstrato wrote:
>>>> Dear Jim,
>>>>
>>>> I did not really follow the discussion so I may be wrong, but if you
>>>> mean that there is a difference between the number of 'main' types,
>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to
>>>> the number of 'main' in the probeset annotation file and not in the
>>>> transcript annotation file.
>>>>
>>>> But as I said, I may have misunderstood the problem.
>>>>
>>>> I am mainly replying because at the beginning of this year I had long
>>>> discussions with DevNet to make sure that the annotation files for the
>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct
>>>> everything what I have found.
>>>>
>>>> Best regards,
>>>> Christian
>>>>
>>>>
>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote:
>>>>> Hi Mark,
>>>>>
>>>>> Thanks for the heads-up. We already knew that Affy messed up the
>>>>> transcript and probeset annotation files (and had them fixed), but
>>>>> didn't think I needed to check the others. Famous last words, no?
>>>>>
>>>>> > x <- readPgf("HuGene-2_0-st.pgf")
>>>>> > table(x$probesetType)
>>>>>
>>>>>               control->affx   control->affx->bac_spike
>>>>>                          18                         18
>>>>>   control->affx->ercc_spike control->affx->polya_spike
>>>>>                          92                         39
>>>>>   control->bgp->antigenomic                       main
>>>>>                          23                     349012
>>>>>            normgene->intron                   reporter
>>>>>                        3575                         82
>>>>>
>>>>> > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv",
>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE)
>>>>> > table(y$category)
>>>>>
>>>>>               control->affx   control->affx->bac_spike
>>>>>                          18                         18
>>>>>   control->affx->ercc-spike control->affx->polya_spike
>>>>>                          92                         39
>>>>>   control->bgp->antigenomic                       main
>>>>>                          23                      44629
>>>>>              normgene->exon           normgene->intron
>>>>>                        1626                       3575
>>>>>                    reporter                     rescue
>>>>>                          82                       3515
>>>>>
>>>>> I'll ping Affymetrix and see what they have to say.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote:
>>>>>> Dear Benilton, James&  Bioconductors,
>>>>>> Thanks for providing the platform design packages for
>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays.
>>>>>>
>>>>>> I use SQL to query these packages&  ultimately retain only 'main'
>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but
>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the
>>>>>> normgene->exon control probes are misclassified as 'main' probes.
>>>>>>
>>>>>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv
>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st
>>>>>> package lists 0, and assigns these 1626 probes to the 'main'
>>>>>> category:
>>>>>>
>>>>>> # probe types:
>>>>>> library(pd.hugene.2.0.st)
>>>>>> conn<- db(pd.hugene.2.0.st)
>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>     type                   type_id
>>>>>> 1     1                      main
>>>>>> 2     2             control->affx
>>>>>> 3     3             control->chip
>>>>>> 4     4 control->bgp->antigenomic
>>>>>> 5     5     control->bgp->genomic
>>>>>> 6     6            normgene->exon
>>>>>> 7     7          normgene->intron
>>>>>> 8     8  rescue->FLmRNA->unmapped
>>>>>> 9     9  control->affx->bac_spike
>>>>>> 10   10            oligo_spike_in
>>>>>> 11   11           r1_bac_spike_at
>>>>>>
>>>>>> # probe counts for each of the probe categories:
>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY
>>>>>> type")
>>>>>>    type count(*)
>>>>>> 1   NA     3728
>>>>>> 2    1   345497
>>>>>> 3    2       18
>>>>>> 4    4       23
>>>>>> 5    7     3575
>>>>>> 6    9       18
>>>>>>
>>>>>> NB: no type 6 probes.
>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see
>>>>>> this issue for all 2.0 and 2.1 arrays (see below)
>>>>>>
>>>>>> Can these mappings please be updated?
>>>>>>
>>>>>> PS, there's a bunch of probes with type = NA in the database. I
>>>>>> haven't investigated these in any detail.
>>>>>>
>>>>>> cheers,
>>>>>> Mark
>>>>>> -----------------------------------------------------
>>>>>> Mark Cowley, PhD
>>>>>>
>>>>>> Genome Informatics Division&  the Centre for Clinical Genomics
>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research,
>>>>>> Sydney, Australia
>>>>>> -----------------------------------------------------
>>>>>>
>>>>>> All 12 packages below:
>>>>>> pd.packages<- c(
>>>>>>    "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st",
>>>>>> "pd.hugene.2.1.st",
>>>>>>    "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st",
>>>>>> "pd.mogene.2.1.st",
>>>>>>    "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st",
>>>>>> "pd.ragene.2.1.st"
>>>>>> )
>>>>>>
>>>>>> a<- b<- list()
>>>>>> for(pd.pkg.name in pd.packages) {
>>>>>>    try({
>>>>>>      require(pd.pkg.name, character.only=TRUE) || stop("Can't load
>>>>>> the
>>>>>> pd.package")
>>>>>>      conn<- db(get(pd.pkg.name))
>>>>>>      a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from
>>>>>> featureSet GROUP BY type")
>>>>>>      b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from
>>>>>> featureSet
>>>>>> WHERE type = 6")[,1]
>>>>>>    })
>>>>>> }
>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>
>>>>>>> a
>>>>>> $pd.hugene.1.0.st.v1
>>>>>>    type count(*)
>>>>>> 1   NA      227
>>>>>> 2    1   253002
>>>>>> 3    2       57
>>>>>> 4    4       45
>>>>>> 5    6     1195
>>>>>> 6    7     2904
>>>>>>
>>>>>> $pd.hugene.1.1.st.v1
>>>>>>    type count(*)
>>>>>> 1   NA      227
>>>>>> 2    1   253002
>>>>>> 3    2       57
>>>>>> 4    4       45
>>>>>> 5    6     1195
>>>>>> 6    7     2904
>>>>>>
>>>>>> $pd.hugene.2.0.st
>>>>>>    type count(*)
>>>>>> 1   NA     3728
>>>>>> 2    1   345497
>>>>>> 3    2       18
>>>>>> 4    4       23
>>>>>> 5    7     3575
>>>>>> 6    9       18
>>>>>>
>>>>>> $pd.hugene.2.1.st
>>>>>>    type count(*)
>>>>>> 1   NA     3728
>>>>>> 2    1   345497
>>>>>> 3    2       18
>>>>>> 4    4       23
>>>>>> 5    7     3575
>>>>>> 6    9       18
>>>>>>
>>>>>> $pd.mogene.1.0.st.v1
>>>>>>    type count(*)
>>>>>> 1   NA       86
>>>>>> 2    1   234878
>>>>>> 3    2       21
>>>>>> 4    4       45
>>>>>> 5    6     1324
>>>>>> 6    7     5222
>>>>>>
>>>>>> $pd.mogene.1.1.st.v1
>>>>>>    type count(*)
>>>>>> 1   NA       86
>>>>>> 2    1   234878
>>>>>> 3    2       21
>>>>>> 4    4       45
>>>>>> 5    6     1324
>>>>>> 6    7     5222
>>>>>>
>>>>>> $pd.mogene.2.0.st
>>>>>>    type count(*)
>>>>>> 1   NA      810
>>>>>> 2    1   263551
>>>>>> 3    2       18
>>>>>> 4    4       23
>>>>>> 5    7     5331
>>>>>> 6    9       18
>>>>>>
>>>>>> $pd.mogene.2.1.st
>>>>>>    type count(*)
>>>>>> 1   NA      810
>>>>>> 2    1   263551
>>>>>> 3    2       18
>>>>>> 4    4       23
>>>>>> 5    7     5331
>>>>>> 6    9       18
>>>>>>
>>>>>> $pd.ragene.1.0.st.v1
>>>>>>    type count(*)
>>>>>> 1   NA      254
>>>>>> 2    1   211195
>>>>>> 3    2       21
>>>>>> 4    4       45
>>>>>> 5    6      399
>>>>>> 6    7     1153
>>>>>>
>>>>>> $pd.ragene.1.1.st.v1
>>>>>>    type count(*)
>>>>>> 1   NA      254
>>>>>> 2    1   211195
>>>>>> 3    2       21
>>>>>> 4    4       45
>>>>>> 5    6      399
>>>>>> 6    7     1153
>>>>>>
>>>>>> $pd.ragene.2.0.st
>>>>>>    type count(*)
>>>>>> 1   NA     1071
>>>>>> 2    1   214018
>>>>>> 3    2       18
>>>>>> 4    4       23
>>>>>> 5    7     5083
>>>>>> 6    9       18
>>>>>>
>>>>>> $pd.ragene.2.1.st
>>>>>>    type count(*)
>>>>>> 1   NA     1071
>>>>>> 2    1   214018
>>>>>> 3    2       18
>>>>>> 4    4       23
>>>>>> 5    7     5083
>>>>>> 6    9       18
>>>>>>
>>>>>>> sapply(b,length)
>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1    pd.hugene.2.0.st
>>>>>> pd.hugene.2.1.st
>>>>>>                 1195                1195
>>>>>> 0                   0
>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1    pd.mogene.2.0.st
>>>>>> pd.mogene.2.1.st
>>>>>>                 1324                1324
>>>>>> 0                   0
>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1    pd.ragene.2.0.st
>>>>>> pd.ragene.2.1.st
>>>>>>                  399                 399
>>>>>> 0                   0
>>>>>>
>>>>>>> sessionInfo()
>>>>>> R version 3.0.0 (2013-04-03)
>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>
>>>>>> locale:
>>>>>>   [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>>>>>   [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>>>>>   [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
>>>>>>   [7] LC_PAPER=C                 LC_NAME=C
>>>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>>>>
>>>>>> attached base packages:
>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>> methods
>>>>>> [8] base
>>>>>>
>>>>>> other attached packages:
>>>>>>   [1] pd.ragene.2.1.st_2.12.1   pd.ragene.2.0.st_2.12.0
>>>>>>   [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0
>>>>>>   [5] pd.mogene.2.1.st_2.12.1   pd.mogene.2.0.st_2.12.0
>>>>>>   [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0
>>>>>>   [9] pd.hugene.2.1.st_3.8.0    pd.hugene.1.1.st.v1_3.8.0
>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0
>>>>>> [13] oligo_1.24.0              Biobase_2.20.0
>>>>>> [15] oligoClasses_1.22.0       BiocGenerics_0.6.0
>>>>>> [17] RSQLite_0.11.4            DBI_0.2-7
>>>>>> [19] BiocInstaller_1.10.2
>>>>>>
>>>>>> loaded via a namespace (and not attached):
>>>>>>   [1] affxparser_1.32.1     affyio_1.28.0         Biostrings_2.28.0
>>>>>>   [4] bit_1.1-10            codetools_0.2-8       ff_2.2-11
>>>>>>   [7] foreach_1.4.1         GenomicRanges_1.12.3  IRanges_1.18.1
>>>>>> [10] iterators_1.0.6       preprocessCore_1.22.0 splines_3.0.0
>>>>>> [13] stats4_3.0.0          tools_3.0.0           zlibbioc_1.6.0
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>     [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>
>



More information about the Bioconductor mailing list