[BioC] pd.hugene.2.0.st missing normgene->exon mappings
James W. MacDonald
jmacdon at uw.edu
Wed Jul 10 16:42:55 CEST 2013
Hi Mark,
I stand corrected. These probesets ARE being summarized at the
transcript level. So the issue is how to mark a probeset as both 'main'
and 'normgene->exon' at the same time in the pd package.
Best,
Jim
On 7/10/2013 9:56 AM, James W. MacDonald wrote:
> Hi Mark,
>
> The issue here is that the fsetid AND the meta_fsetid for probeset
> 16934607 are the same. In other words, when you summarize at the
> probeset level, there is a probeset 16934607 that is intended to
> interrogate an exon of EIF3D. And when you summarize at the transcript
> level, there is a probeset 16934607 that is intended to be a positive
> control.
>
> So at one level, this probeset IS a main probeset, and that is
> reflected in the pgf file. Prior to this series of arrays, the
> positive controls were never also a part of an existing transcript, so
> this issue never came up. As you will note in my earlier response to
> Christian, I agree that this isn't ideal, and that it would be better
> to not tag a positive control as 'main' in the pd package.
>
> However, I don't know how this would be fixed. Note that
>
> > dbGetQuery(con, "select * from core_mps where meta_fsetid='16934607';")
> meta_fsetid transcript_cluster_id fsetid
> 1 16934607 16934607 16934607
>
> and
>
> > dbGetQuery(con, "select * from featureSet where
> transcript_cluster_id='16934583';")[,c(1,5,10)]
> fsetid transcript_cluster_id type
> 1 16934584 16934583 1
> 2 16934585 16934583 1
> 3 16934586 16934583 1
> 4 16934587 16934583 1
> 5 16934588 16934583 1
> 6 16934589 16934583 1
> 7 16934590 16934583 1
> 8 16934591 16934583 1
> 9 16934592 16934583 1
> 10 16934593 16934583 1
> 11 16934594 16934583 1
> 12 16934595 16934583 1
> 13 16934596 16934583 1
> 14 16934597 16934583 1
> 15 16934598 16934583 1
> 16 16934600 16934583 1
> 17 16934601 16934583 1
> 18 16934602 16934583 1
> 19 16934603 16934583 1
> 20 16934604 16934583 1
> 21 16934607 16934583 1
> 22 16934608 16934583 1
> 23 16934609 16934583 1
> 24 16934610 16934583 1
>
> and
>
> > dbGetQuery(con, "select * from featureSet where
> transcript_cluster_id='16934607';")
> [1] fsetid strand start
> [4] stop transcript_cluster_id exon_id
> [7] crosshyb_type level chrom
> [10] type
>
> So this seems like what you would want to happen. There isn't a
> probeset 16934607 when you summarize at the transcript level
>
>
> On 7/10/2013 6:31 AM, Mark Cowley wrote:
>> Hi,
>> I see the point you're making Christian.
>> I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file.
>>
>> Cheers and thanks for looking into this
>>
>> Mark
>>
>> Sent from my iPhone
>>
>> On 10/07/2013, at 7:53 AM, "cstrato"<cstrato at aon.at> wrote:
>>
>>> Dear Jim,
>>>
>>> As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain.
>>>
>>> Best regards,
>>> Christian
>>>
>>>
>>> On 7/9/13 11:46 PM, James W. MacDonald wrote:
>>>> Hi Christian,
>>>>
>>>> Thanks for pointing that out. There is still a bit of an inconsistency
>>>> with the pd packages that should probably be corrected, as at the
>>>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D,
>>>> whereas at the transcript level, this same probeset is intended to be a
>>>> positive control (and as you note below, these probes are incorporated
>>>> into a larger probeset intended to measure the EIF3D transcript).
>>>>
>>>> It would be nice to be able to filter out the controls like Mark
>>>> attempted (and I do regularly as well).
>>>>
>>>> Mark - I talked with Benilton Carvalho about this, and he will take a
>>>> look next week.
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>> On 7/9/2013 3:38 PM, cstrato wrote:
>>>>> Dear Jim,
>>>>>
>>>>> In xps I use as basic file for exon arrays the probeset annotation
>>>>> file and then compare the data to the data from the pgf-file. Any
>>>>> differences will be reported.
>>>>>
>>>>> I have just checked the different files for HuGene-2_0-st. If you
>>>>> check as an example the following probeset_ids:
>>>>> 16934607
>>>>> 16934608
>>>>> 16934609
>>>>> 16934610
>>>>>
>>>>> Then you will see that the transcript annotation file lists these ids
>>>>> as 'normgene->exon' and 'pos_control'. However, the probeset
>>>>> annotation file lists these ids as 'main' belonging to gene EIF3D with
>>>>> transcript_cluster_id 16934583. Looking for this id in the transcript
>>>>> annotation file reveals that the number of 'total_probes' is 24.
>>>>> Indeed, the probeset annotation file lists 24 probesets including the
>>>>> four above mentioned probeset_ids.
>>>>>
>>>>> This means that although these four probesets are listed in the
>>>>> transcript annotation file as 'normgene->exon' the label 'main' in the
>>>>> pgf-file is correct since these probesets are part of the gene EIF3D.
>>>>>
>>>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets
>>>>> listed as 'normgene->exon'. However, in this case these probesets are
>>>>> also listed as 'normgene->exon' in the probeset annotation file, i.e.
>>>>> these probesets do not belong to any transcript listed in the probeset
>>>>> annotation file.
>>>>>
>>>>> Best regards,
>>>>> Christian
>>>>>
>>>>>
>>>>> On 7/9/13 8:46 PM, James W. MacDonald wrote:
>>>>>> Hi Christian,
>>>>>>
>>>>>> That's not the issue. Instead, the issue is that the pgf file lists the
>>>>>> normgene->exon probeset IDs as being 'main'. I have received a response
>>>>>> from Affy stating that the qcc file lists the normgene->exon probesets
>>>>>> as pos_control, but that seems orthogonal to the issue at hand.
>>>>>>
>>>>>>> qcc<- read.table("HuGene-2_0-st.qcc", comment.char="#",
>>>>>> stringsAsFactors=F, header=T)
>>>>>>> pgf<- readPgf("HuGene-2_0-st.pgf")
>>>>>>> head(qcc)
>>>>>> probeset_id group_name probeset_name quantification_in_header
>>>>>> 1 16650001 neg_control 16650001 0
>>>>>> 2 16650003 neg_control 16650003 0
>>>>>>
>>>>>> ## get the positive controls (normgene->exon probesets) from the qcc
>>>>>> file
>>>>>>> pos_cont<- qcc[qcc[,2] == "pos_control",1]
>>>>>> ## compare to pgf file
>>>>>>> x<- pgf$probesetType[pgf$probesetId %in% pos_cont]
>>>>>>> table(x)
>>>>>> x
>>>>>> main
>>>>>> 1626
>>>>>>
>>>>>> So in the pgf file, these probesets are being called 'main' instead of
>>>>>> some sort of control. How do you handle this in xps? Do you use the pgf
>>>>>> file?
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 7/9/2013 2:06 PM, cstrato wrote:
>>>>>>> Dear Jim,
>>>>>>>
>>>>>>> I did not really follow the discussion so I may be wrong, but if you
>>>>>>> mean that there is a difference between the number of 'main' types,
>>>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to
>>>>>>> the number of 'main' in the probeset annotation file and not in the
>>>>>>> transcript annotation file.
>>>>>>>
>>>>>>> But as I said, I may have misunderstood the problem.
>>>>>>>
>>>>>>> I am mainly replying because at the beginning of this year I had long
>>>>>>> discussions with DevNet to make sure that the annotation files for the
>>>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct
>>>>>>> everything what I have found.
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Christian
>>>>>>>
>>>>>>>
>>>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote:
>>>>>>>> Hi Mark,
>>>>>>>>
>>>>>>>> Thanks for the heads-up. We already knew that Affy messed up the
>>>>>>>> transcript and probeset annotation files (and had them fixed), but
>>>>>>>> didn't think I needed to check the others. Famous last words, no?
>>>>>>>>
>>>>>>>>> x<- readPgf("HuGene-2_0-st.pgf")
>>>>>>>>> table(x$probesetType)
>>>>>>>> control->affx control->affx->bac_spike
>>>>>>>> 18 18
>>>>>>>> control->affx->ercc_spike control->affx->polya_spike
>>>>>>>> 92 39
>>>>>>>> control->bgp->antigenomic main
>>>>>>>> 23 349012
>>>>>>>> normgene->intron reporter
>>>>>>>> 3575 82
>>>>>>>>
>>>>>>>>> y<- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv",
>>>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE)
>>>>>>>>> table(y$category)
>>>>>>>> control->affx control->affx->bac_spike
>>>>>>>> 18 18
>>>>>>>> control->affx->ercc-spike control->affx->polya_spike
>>>>>>>> 92 39
>>>>>>>> control->bgp->antigenomic main
>>>>>>>> 23 44629
>>>>>>>> normgene->exon normgene->intron
>>>>>>>> 1626 3575
>>>>>>>> reporter rescue
>>>>>>>> 82 3515
>>>>>>>>
>>>>>>>> I'll ping Affymetrix and see what they have to say.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>>
>>>>>>>> Jim
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote:
>>>>>>>>> Dear Benilton, James& Bioconductors,
>>>>>>>>> Thanks for providing the platform design packages for
>>>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays.
>>>>>>>>>
>>>>>>>>> I use SQL to query these packages& ultimately retain only 'main'
>>>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but
>>>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the
>>>>>>>>> normgene->exon control probes are misclassified as 'main' probes.
>>>>>>>>>
>>>>>>>>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv
>>>>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st
>>>>>>>>> package lists 0, and assigns these 1626 probes to the 'main'
>>>>>>>>> category:
>>>>>>>>>
>>>>>>>>> # probe types:
>>>>>>>>> library(pd.hugene.2.0.st)
>>>>>>>>> conn<- db(pd.hugene.2.0.st)
>>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>>>> type type_id
>>>>>>>>> 1 1 main
>>>>>>>>> 2 2 control->affx
>>>>>>>>> 3 3 control->chip
>>>>>>>>> 4 4 control->bgp->antigenomic
>>>>>>>>> 5 5 control->bgp->genomic
>>>>>>>>> 6 6 normgene->exon
>>>>>>>>> 7 7 normgene->intron
>>>>>>>>> 8 8 rescue->FLmRNA->unmapped
>>>>>>>>> 9 9 control->affx->bac_spike
>>>>>>>>> 10 10 oligo_spike_in
>>>>>>>>> 11 11 r1_bac_spike_at
>>>>>>>>>
>>>>>>>>> # probe counts for each of the probe categories:
>>>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY
>>>>>>>>> type")
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 3728
>>>>>>>>> 2 1 345497
>>>>>>>>> 3 2 18
>>>>>>>>> 4 4 23
>>>>>>>>> 5 7 3575
>>>>>>>>> 6 9 18
>>>>>>>>>
>>>>>>>>> NB: no type 6 probes.
>>>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see
>>>>>>>>> this issue for all 2.0 and 2.1 arrays (see below)
>>>>>>>>>
>>>>>>>>> Can these mappings please be updated?
>>>>>>>>>
>>>>>>>>> PS, there's a bunch of probes with type = NA in the database. I
>>>>>>>>> haven't investigated these in any detail.
>>>>>>>>>
>>>>>>>>> cheers,
>>>>>>>>> Mark
>>>>>>>>> -----------------------------------------------------
>>>>>>>>> Mark Cowley, PhD
>>>>>>>>>
>>>>>>>>> Genome Informatics Division& the Centre for Clinical Genomics
>>>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research,
>>>>>>>>> Sydney, Australia
>>>>>>>>> -----------------------------------------------------
>>>>>>>>>
>>>>>>>>> All 12 packages below:
>>>>>>>>> pd.packages<- c(
>>>>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st",
>>>>>>>>> "pd.hugene.2.1.st",
>>>>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st",
>>>>>>>>> "pd.mogene.2.1.st",
>>>>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st",
>>>>>>>>> "pd.ragene.2.1.st"
>>>>>>>>> )
>>>>>>>>>
>>>>>>>>> a<- b<- list()
>>>>>>>>> for(pd.pkg.name in pd.packages) {
>>>>>>>>> try({
>>>>>>>>> require(pd.pkg.name, character.only=TRUE) || stop("Can't load
>>>>>>>>> the
>>>>>>>>> pd.package")
>>>>>>>>> conn<- db(get(pd.pkg.name))
>>>>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from
>>>>>>>>> featureSet GROUP BY type")
>>>>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from
>>>>>>>>> featureSet
>>>>>>>>> WHERE type = 6")[,1]
>>>>>>>>> })
>>>>>>>>> }
>>>>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>>>>
>>>>>>>>>> a
>>>>>>>>> $pd.hugene.1.0.st.v1
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 227
>>>>>>>>> 2 1 253002
>>>>>>>>> 3 2 57
>>>>>>>>> 4 4 45
>>>>>>>>> 5 6 1195
>>>>>>>>> 6 7 2904
>>>>>>>>>
>>>>>>>>> $pd.hugene.1.1.st.v1
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 227
>>>>>>>>> 2 1 253002
>>>>>>>>> 3 2 57
>>>>>>>>> 4 4 45
>>>>>>>>> 5 6 1195
>>>>>>>>> 6 7 2904
>>>>>>>>>
>>>>>>>>> $pd.hugene.2.0.st
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 3728
>>>>>>>>> 2 1 345497
>>>>>>>>> 3 2 18
>>>>>>>>> 4 4 23
>>>>>>>>> 5 7 3575
>>>>>>>>> 6 9 18
>>>>>>>>>
>>>>>>>>> $pd.hugene.2.1.st
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 3728
>>>>>>>>> 2 1 345497
>>>>>>>>> 3 2 18
>>>>>>>>> 4 4 23
>>>>>>>>> 5 7 3575
>>>>>>>>> 6 9 18
>>>>>>>>>
>>>>>>>>> $pd.mogene.1.0.st.v1
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 86
>>>>>>>>> 2 1 234878
>>>>>>>>> 3 2 21
>>>>>>>>> 4 4 45
>>>>>>>>> 5 6 1324
>>>>>>>>> 6 7 5222
>>>>>>>>>
>>>>>>>>> $pd.mogene.1.1.st.v1
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 86
>>>>>>>>> 2 1 234878
>>>>>>>>> 3 2 21
>>>>>>>>> 4 4 45
>>>>>>>>> 5 6 1324
>>>>>>>>> 6 7 5222
>>>>>>>>>
>>>>>>>>> $pd.mogene.2.0.st
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 810
>>>>>>>>> 2 1 263551
>>>>>>>>> 3 2 18
>>>>>>>>> 4 4 23
>>>>>>>>> 5 7 5331
>>>>>>>>> 6 9 18
>>>>>>>>>
>>>>>>>>> $pd.mogene.2.1.st
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 810
>>>>>>>>> 2 1 263551
>>>>>>>>> 3 2 18
>>>>>>>>> 4 4 23
>>>>>>>>> 5 7 5331
>>>>>>>>> 6 9 18
>>>>>>>>>
>>>>>>>>> $pd.ragene.1.0.st.v1
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 254
>>>>>>>>> 2 1 211195
>>>>>>>>> 3 2 21
>>>>>>>>> 4 4 45
>>>>>>>>> 5 6 399
>>>>>>>>> 6 7 1153
>>>>>>>>>
>>>>>>>>> $pd.ragene.1.1.st.v1
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 254
>>>>>>>>> 2 1 211195
>>>>>>>>> 3 2 21
>>>>>>>>> 4 4 45
>>>>>>>>> 5 6 399
>>>>>>>>> 6 7 1153
>>>>>>>>>
>>>>>>>>> $pd.ragene.2.0.st
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 1071
>>>>>>>>> 2 1 214018
>>>>>>>>> 3 2 18
>>>>>>>>> 4 4 23
>>>>>>>>> 5 7 5083
>>>>>>>>> 6 9 18
>>>>>>>>>
>>>>>>>>> $pd.ragene.2.1.st
>>>>>>>>> type count(*)
>>>>>>>>> 1 NA 1071
>>>>>>>>> 2 1 214018
>>>>>>>>> 3 2 18
>>>>>>>>> 4 4 23
>>>>>>>>> 5 7 5083
>>>>>>>>> 6 9 18
>>>>>>>>>
>>>>>>>>>> sapply(b,length)
>>>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st
>>>>>>>>> pd.hugene.2.1.st
>>>>>>>>> 1195 1195
>>>>>>>>> 0 0
>>>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st
>>>>>>>>> pd.mogene.2.1.st
>>>>>>>>> 1324 1324
>>>>>>>>> 0 0
>>>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st
>>>>>>>>> pd.ragene.2.1.st
>>>>>>>>> 399 399
>>>>>>>>> 0 0
>>>>>>>>>
>>>>>>>>>> sessionInfo()
>>>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>>>>
>>>>>>>>> locale:
>>>>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
>>>>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
>>>>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
>>>>>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>>>>>>>
>>>>>>>>> attached base packages:
>>>>>>>>> [1] parallel stats graphics grDevices utils datasets
>>>>>>>>> methods
>>>>>>>>> [8] base
>>>>>>>>>
>>>>>>>>> other attached packages:
>>>>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0
>>>>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0
>>>>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0
>>>>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0
>>>>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0
>>>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0
>>>>>>>>> [13] oligo_1.24.0 Biobase_2.20.0
>>>>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0
>>>>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7
>>>>>>>>> [19] BiocInstaller_1.10.2
>>>>>>>>>
>>>>>>>>> loaded via a namespace (and not attached):
>>>>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0
>>>>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11
>>>>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1
>>>>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0
>>>>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioconductor mailing list
>>>>>>>>> Bioconductor at r-project.org
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>>>> Search the archives:
>>>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> --
> James W. MacDonald, M.S.
> Biostatistician
> University of Washington
> Environmental and Occupational Health Sciences
> 4225 Roosevelt Way NE, # 100
> Seattle WA 98105-6099
--
James W. MacDonald, M.S.
Biostatistician
University of Washington
Environmental and Occupational Health Sciences
4225 Roosevelt Way NE, # 100
Seattle WA 98105-6099
More information about the Bioconductor
mailing list