[BioC] pd.hugene.2.0.st missing normgene->exon mappings

Wed Jul 10 16:04:17 CEST 2013

Hi Mark,

The issue here is that the fsetid AND the meta_fsetid for probeset 
16934607 are the same. In other words, when you summarize at the 
probeset level, there is a probeset 16934607 that is intended to 
interrogate an exon of EIF3D. And when you summarize at the transcript 
level, there is a probeset 16934607 that is intended to be a positive 
control.

So at one level, this probeset IS a main probeset, and that is reflected 
in the pgf file. Prior to this series of arrays, the positive controls 
were never also a part of an existing transcript, so this issue never 
came up. As you will note in my earlier response to Christian, I agree 
that this isn't ideal, and that it would be better to not tag a positive 
control as 'main' in the pd package.

However, I don't know how this would be fixed. Note that

 > dbGetQuery(con, "select * from core_mps where meta_fsetid='16934607';")
   meta_fsetid transcript_cluster_id   fsetid
1    16934607              16934607 16934607

and

 > dbGetQuery(con, "select * from featureSet where 
transcript_cluster_id='16934583';")[,c(1,5,10)]
      fsetid transcript_cluster_id type
1  16934584              16934583    1
2  16934585              16934583    1
3  16934586              16934583    1
4  16934587              16934583    1
5  16934588              16934583    1
6  16934589              16934583    1
7  16934590              16934583    1
8  16934591              16934583    1
9  16934592              16934583    1
10 16934593              16934583    1
11 16934594              16934583    1
12 16934595              16934583    1
13 16934596              16934583    1
14 16934597              16934583    1
15 16934598              16934583    1
16 16934600              16934583    1
17 16934601              16934583    1
18 16934602              16934583    1
19 16934603              16934583    1
20 16934604              16934583    1
21 16934607              16934583    1
22 16934608              16934583    1
23 16934609              16934583    1
24 16934610              16934583    1

and

 > dbGetQuery(con, "select * from featureSet where 
transcript_cluster_id='16934607';")
  [1] fsetid                strand                start
  [4] stop                  transcript_cluster_id exon_id
  [7] crosshyb_type         level                 chrom
[10] type

So this seems like what you would want to happen. The 16934607 probeset 
isn't summarized at the transcript level, but is incorporated into the 
16934583 probeset, in which case the 'main' designation is correct.

Best,

Jim

On 7/10/2013 6:31 AM, Mark Cowley wrote:
> Hi,
> I see the point you're making Christian.
> I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file.
>
> Cheers and thanks for looking into this
>
> Mark
>
> Sent from my iPhone
>
> On 10/07/2013, at 7:53 AM, "cstrato"<cstrato at aon.at>  wrote:
>
>> Dear Jim,
>>
>> As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain.
>>
>> Best regards,
>> Christian
>>
>>
>> On 7/9/13 11:46 PM, James W. MacDonald wrote:
>>> Hi Christian,
>>>
>>> Thanks for pointing that out. There is still a bit of an inconsistency
>>> with the pd packages that should probably be corrected, as at the
>>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D,
>>> whereas at the transcript level, this same probeset is intended to be a
>>> positive control (and as you note below, these probes are incorporated
>>> into a larger probeset intended to measure the EIF3D transcript).
>>>
>>> It would be nice to be able to filter out the controls like Mark
>>> attempted (and I do regularly as well).
>>>
>>> Mark - I talked with Benilton Carvalho about this, and he will take a
>>> look next week.
>>>
>>> Best,
>>>
>>> Jim
>>>
>>>
>>>
>>> On 7/9/2013 3:38 PM, cstrato wrote:
>>>> Dear Jim,
>>>>
>>>> In xps I use as basic file for exon arrays the probeset annotation
>>>> file and then compare the data to the data from the pgf-file. Any
>>>> differences will be reported.
>>>>
>>>> I have just checked the different files for HuGene-2_0-st. If you
>>>> check as an example the following probeset_ids:
>>>> 16934607
>>>> 16934608
>>>> 16934609
>>>> 16934610
>>>>
>>>> Then you will see that the transcript annotation file lists these ids
>>>> as 'normgene->exon' and 'pos_control'. However, the probeset
>>>> annotation file lists these ids as 'main' belonging to gene EIF3D with
>>>> transcript_cluster_id 16934583. Looking for this id in the transcript
>>>> annotation file reveals that the number of 'total_probes' is 24.
>>>> Indeed, the probeset annotation file lists 24 probesets including the
>>>> four above mentioned probeset_ids.
>>>>
>>>> This means that although these four probesets are listed in the
>>>> transcript annotation file as 'normgene->exon' the label 'main' in the
>>>> pgf-file is correct since these probesets are part of the gene EIF3D.
>>>>
>>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets
>>>> listed as 'normgene->exon'. However, in this case these probesets are
>>>> also listed as 'normgene->exon' in the probeset annotation file, i.e.
>>>> these probesets do not belong to any transcript listed in the probeset
>>>> annotation file.
>>>>
>>>> Best regards,
>>>> Christian
>>>>
>>>>
>>>> On 7/9/13 8:46 PM, James W. MacDonald wrote:
>>>>> Hi Christian,
>>>>>
>>>>> That's not the issue. Instead, the issue is that the pgf file lists the
>>>>> normgene->exon probeset IDs as being 'main'. I have received a response
>>>>> from Affy stating that the qcc file lists the normgene->exon probesets
>>>>> as pos_control, but that seems orthogonal to the issue at hand.
>>>>>
>>>>>> qcc<- read.table("HuGene-2_0-st.qcc", comment.char="#",
>>>>> stringsAsFactors=F, header=T)
>>>>>> pgf<- readPgf("HuGene-2_0-st.pgf")
>>>>>> head(qcc)
>>>>>    probeset_id  group_name probeset_name quantification_in_header
>>>>> 1    16650001 neg_control      16650001                        0
>>>>> 2    16650003 neg_control      16650003                        0
>>>>>
>>>>> ## get the positive controls (normgene->exon probesets) from the qcc
>>>>> file
>>>>>> pos_cont<- qcc[qcc[,2] == "pos_control",1]
>>>>> ## compare to pgf file
>>>>>> x<- pgf$probesetType[pgf$probesetId %in% pos_cont]
>>>>>> table(x)
>>>>> x
>>>>> main
>>>>> 1626
>>>>>
>>>>> So in the pgf file, these probesets are being called 'main' instead of
>>>>> some sort of control. How do you handle this in xps? Do you use the pgf
>>>>> file?
>>>>>
>>>>> Best,
>>>>>
>>>>> Jim
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 7/9/2013 2:06 PM, cstrato wrote:
>>>>>> Dear Jim,
>>>>>>
>>>>>> I did not really follow the discussion so I may be wrong, but if you
>>>>>> mean that there is a difference between the number of 'main' types,
>>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to
>>>>>> the number of 'main' in the probeset annotation file and not in the
>>>>>> transcript annotation file.
>>>>>>
>>>>>> But as I said, I may have misunderstood the problem.
>>>>>>
>>>>>> I am mainly replying because at the beginning of this year I had long
>>>>>> discussions with DevNet to make sure that the annotation files for the
>>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct
>>>>>> everything what I have found.
>>>>>>
>>>>>> Best regards,
>>>>>> Christian
>>>>>>
>>>>>>
>>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote:
>>>>>>> Hi Mark,
>>>>>>>
>>>>>>> Thanks for the heads-up. We already knew that Affy messed up the
>>>>>>> transcript and probeset annotation files (and had them fixed), but
>>>>>>> didn't think I needed to check the others. Famous last words, no?
>>>>>>>
>>>>>>>> x<- readPgf("HuGene-2_0-st.pgf")
>>>>>>>> table(x$probesetType)
>>>>>>>               control->affx   control->affx->bac_spike
>>>>>>>                          18                         18
>>>>>>>   control->affx->ercc_spike control->affx->polya_spike
>>>>>>>                          92                         39
>>>>>>>   control->bgp->antigenomic                       main
>>>>>>>                          23                     349012
>>>>>>>            normgene->intron                   reporter
>>>>>>>                        3575                         82
>>>>>>>
>>>>>>>> y<- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv",
>>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE)
>>>>>>>> table(y$category)
>>>>>>>               control->affx   control->affx->bac_spike
>>>>>>>                          18                         18
>>>>>>>   control->affx->ercc-spike control->affx->polya_spike
>>>>>>>                          92                         39
>>>>>>>   control->bgp->antigenomic                       main
>>>>>>>                          23                      44629
>>>>>>>              normgene->exon           normgene->intron
>>>>>>>                        1626                       3575
>>>>>>>                    reporter                     rescue
>>>>>>>                          82                       3515
>>>>>>>
>>>>>>> I'll ping Affymetrix and see what they have to say.
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>> Jim
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote:
>>>>>>>> Dear Benilton, James&   Bioconductors,
>>>>>>>> Thanks for providing the platform design packages for
>>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays.
>>>>>>>>
>>>>>>>> I use SQL to query these packages&   ultimately retain only 'main'
>>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but
>>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the
>>>>>>>> normgene->exon control probes are misclassified as 'main' probes.
>>>>>>>>
>>>>>>>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv
>>>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st
>>>>>>>> package lists 0, and assigns these 1626 probes to the 'main'
>>>>>>>> category:
>>>>>>>>
>>>>>>>> # probe types:
>>>>>>>> library(pd.hugene.2.0.st)
>>>>>>>> conn<- db(pd.hugene.2.0.st)
>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>>>     type                   type_id
>>>>>>>> 1     1                      main
>>>>>>>> 2     2             control->affx
>>>>>>>> 3     3             control->chip
>>>>>>>> 4     4 control->bgp->antigenomic
>>>>>>>> 5     5     control->bgp->genomic
>>>>>>>> 6     6            normgene->exon
>>>>>>>> 7     7          normgene->intron
>>>>>>>> 8     8  rescue->FLmRNA->unmapped
>>>>>>>> 9     9  control->affx->bac_spike
>>>>>>>> 10   10            oligo_spike_in
>>>>>>>> 11   11           r1_bac_spike_at
>>>>>>>>
>>>>>>>> # probe counts for each of the probe categories:
>>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY
>>>>>>>> type")
>>>>>>>>    type count(*)
>>>>>>>> 1   NA     3728
>>>>>>>> 2    1   345497
>>>>>>>> 3    2       18
>>>>>>>> 4    4       23
>>>>>>>> 5    7     3575
>>>>>>>> 6    9       18
>>>>>>>>
>>>>>>>> NB: no type 6 probes.
>>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see
>>>>>>>> this issue for all 2.0 and 2.1 arrays (see below)
>>>>>>>>
>>>>>>>> Can these mappings please be updated?
>>>>>>>>
>>>>>>>> PS, there's a bunch of probes with type = NA in the database. I
>>>>>>>> haven't investigated these in any detail.
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>> Mark
>>>>>>>> -----------------------------------------------------
>>>>>>>> Mark Cowley, PhD
>>>>>>>>
>>>>>>>> Genome Informatics Division&   the Centre for Clinical Genomics
>>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research,
>>>>>>>> Sydney, Australia
>>>>>>>> -----------------------------------------------------
>>>>>>>>
>>>>>>>> All 12 packages below:
>>>>>>>> pd.packages<- c(
>>>>>>>>    "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st",
>>>>>>>> "pd.hugene.2.1.st",
>>>>>>>>    "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st",
>>>>>>>> "pd.mogene.2.1.st",
>>>>>>>>    "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st",
>>>>>>>> "pd.ragene.2.1.st"
>>>>>>>> )
>>>>>>>>
>>>>>>>> a<- b<- list()
>>>>>>>> for(pd.pkg.name in pd.packages) {
>>>>>>>>    try({
>>>>>>>>      require(pd.pkg.name, character.only=TRUE) || stop("Can't load
>>>>>>>> the
>>>>>>>> pd.package")
>>>>>>>>      conn<- db(get(pd.pkg.name))
>>>>>>>>      a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from
>>>>>>>> featureSet GROUP BY type")
>>>>>>>>      b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from
>>>>>>>> featureSet
>>>>>>>> WHERE type = 6")[,1]
>>>>>>>>    })
>>>>>>>> }
>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>>>
>>>>>>>>> a
>>>>>>>> $pd.hugene.1.0.st.v1
>>>>>>>>    type count(*)
>>>>>>>> 1   NA      227
>>>>>>>> 2    1   253002
>>>>>>>> 3    2       57
>>>>>>>> 4    4       45
>>>>>>>> 5    6     1195
>>>>>>>> 6    7     2904
>>>>>>>>
>>>>>>>> $pd.hugene.1.1.st.v1
>>>>>>>>    type count(*)
>>>>>>>> 1   NA      227
>>>>>>>> 2    1   253002
>>>>>>>> 3    2       57
>>>>>>>> 4    4       45
>>>>>>>> 5    6     1195
>>>>>>>> 6    7     2904
>>>>>>>>
>>>>>>>> $pd.hugene.2.0.st
>>>>>>>>    type count(*)
>>>>>>>> 1   NA     3728
>>>>>>>> 2    1   345497
>>>>>>>> 3    2       18
>>>>>>>> 4    4       23
>>>>>>>> 5    7     3575
>>>>>>>> 6    9       18
>>>>>>>>
>>>>>>>> $pd.hugene.2.1.st
>>>>>>>>    type count(*)
>>>>>>>> 1   NA     3728
>>>>>>>> 2    1   345497
>>>>>>>> 3    2       18
>>>>>>>> 4    4       23
>>>>>>>> 5    7     3575
>>>>>>>> 6    9       18
>>>>>>>>
>>>>>>>> $pd.mogene.1.0.st.v1
>>>>>>>>    type count(*)
>>>>>>>> 1   NA       86
>>>>>>>> 2    1   234878
>>>>>>>> 3    2       21
>>>>>>>> 4    4       45
>>>>>>>> 5    6     1324
>>>>>>>> 6    7     5222
>>>>>>>>
>>>>>>>> $pd.mogene.1.1.st.v1
>>>>>>>>    type count(*)
>>>>>>>> 1   NA       86
>>>>>>>> 2    1   234878
>>>>>>>> 3    2       21
>>>>>>>> 4    4       45
>>>>>>>> 5    6     1324
>>>>>>>> 6    7     5222
>>>>>>>>
>>>>>>>> $pd.mogene.2.0.st
>>>>>>>>    type count(*)
>>>>>>>> 1   NA      810
>>>>>>>> 2    1   263551
>>>>>>>> 3    2       18
>>>>>>>> 4    4       23
>>>>>>>> 5    7     5331
>>>>>>>> 6    9       18
>>>>>>>>
>>>>>>>> $pd.mogene.2.1.st
>>>>>>>>    type count(*)
>>>>>>>> 1   NA      810
>>>>>>>> 2    1   263551
>>>>>>>> 3    2       18
>>>>>>>> 4    4       23
>>>>>>>> 5    7     5331
>>>>>>>> 6    9       18
>>>>>>>>
>>>>>>>> $pd.ragene.1.0.st.v1
>>>>>>>>    type count(*)
>>>>>>>> 1   NA      254
>>>>>>>> 2    1   211195
>>>>>>>> 3    2       21
>>>>>>>> 4    4       45
>>>>>>>> 5    6      399
>>>>>>>> 6    7     1153
>>>>>>>>
>>>>>>>> $pd.ragene.1.1.st.v1
>>>>>>>>    type count(*)
>>>>>>>> 1   NA      254
>>>>>>>> 2    1   211195
>>>>>>>> 3    2       21
>>>>>>>> 4    4       45
>>>>>>>> 5    6      399
>>>>>>>> 6    7     1153
>>>>>>>>
>>>>>>>> $pd.ragene.2.0.st
>>>>>>>>    type count(*)
>>>>>>>> 1   NA     1071
>>>>>>>> 2    1   214018
>>>>>>>> 3    2       18
>>>>>>>> 4    4       23
>>>>>>>> 5    7     5083
>>>>>>>> 6    9       18
>>>>>>>>
>>>>>>>> $pd.ragene.2.1.st
>>>>>>>>    type count(*)
>>>>>>>> 1   NA     1071
>>>>>>>> 2    1   214018
>>>>>>>> 3    2       18
>>>>>>>> 4    4       23
>>>>>>>> 5    7     5083
>>>>>>>> 6    9       18
>>>>>>>>
>>>>>>>>> sapply(b,length)
>>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1    pd.hugene.2.0.st
>>>>>>>> pd.hugene.2.1.st
>>>>>>>>                 1195                1195
>>>>>>>> 0                   0
>>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1    pd.mogene.2.0.st
>>>>>>>> pd.mogene.2.1.st
>>>>>>>>                 1324                1324
>>>>>>>> 0                   0
>>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1    pd.ragene.2.0.st
>>>>>>>> pd.ragene.2.1.st
>>>>>>>>                  399                 399
>>>>>>>> 0                   0
>>>>>>>>
>>>>>>>>> sessionInfo()
>>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>>>
>>>>>>>> locale:
>>>>>>>>   [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C
>>>>>>>>   [3] LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8
>>>>>>>>   [5] LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8
>>>>>>>>   [7] LC_PAPER=C                 LC_NAME=C
>>>>>>>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>>>>>>
>>>>>>>> attached base packages:
>>>>>>>> [1] parallel  stats     graphics  grDevices utils     datasets
>>>>>>>> methods
>>>>>>>> [8] base
>>>>>>>>
>>>>>>>> other attached packages:
>>>>>>>>   [1] pd.ragene.2.1.st_2.12.1   pd.ragene.2.0.st_2.12.0
>>>>>>>>   [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0
>>>>>>>>   [5] pd.mogene.2.1.st_2.12.1   pd.mogene.2.0.st_2.12.0
>>>>>>>>   [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0
>>>>>>>>   [9] pd.hugene.2.1.st_3.8.0    pd.hugene.1.1.st.v1_3.8.0
>>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0
>>>>>>>> [13] oligo_1.24.0              Biobase_2.20.0
>>>>>>>> [15] oligoClasses_1.22.0       BiocGenerics_0.6.0
>>>>>>>> [17] RSQLite_0.11.4            DBI_0.2-7
>>>>>>>> [19] BiocInstaller_1.10.2
>>>>>>>>
>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>   [1] affxparser_1.32.1     affyio_1.28.0         Biostrings_2.28.0
>>>>>>>>   [4] bit_1.1-10            codetools_0.2-8       ff_2.2-11
>>>>>>>>   [7] foreach_1.4.1         GenomicRanges_1.12.3  IRanges_1.18.1
>>>>>>>> [10] iterators_1.0.6       preprocessCore_1.22.0 splines_3.0.0
>>>>>>>> [13] stats4_3.0.0          tools_3.0.0           zlibbioc_1.6.0
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>     [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioconductor mailing list
>>>>>>>> Bioconductor at r-project.org
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>> Search the archives:
>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099