[BioC] pd.hugene.2.0.st missing normgene->exon mappings
Mark Cowley
m.cowley at garvan.org.au
Wed Jul 10 12:31:57 CEST 2013
Hi,
I see the point you're making Christian.
I'm offline now, so cant check this, but i assume that EIF3D and the pos_control meta-probeset in question have different transcript cluster id's. it doesn't make sense for the pos_control transcript cluster Id to be tagged as 'main'. My grep -f was still running when i left work to confirm whether all normgene->exon probesets are all deployed within different real genes in the probeset csv file.
Cheers and thanks for looking into this
Mark
Sent from my iPhone
On 10/07/2013, at 7:53 AM, "cstrato" <cstrato at aon.at> wrote:
> Dear Jim,
>
> As far as I understand it, at the transcript level 16934607 is on one hand part of the EIF3D transcript and on the other hand does serve as a positive control. To me this seems to be no contradiction, but probably only DevNet can explain.
>
> Best regards,
> Christian
>
>
> On 7/9/13 11:46 PM, James W. MacDonald wrote:
>> Hi Christian,
>>
>> Thanks for pointing that out. There is still a bit of an inconsistency
>> with the pd packages that should probably be corrected, as at the
>> probeset level e.g., 16934607 is intended to measure an exon of EIF3D,
>> whereas at the transcript level, this same probeset is intended to be a
>> positive control (and as you note below, these probes are incorporated
>> into a larger probeset intended to measure the EIF3D transcript).
>>
>> It would be nice to be able to filter out the controls like Mark
>> attempted (and I do regularly as well).
>>
>> Mark - I talked with Benilton Carvalho about this, and he will take a
>> look next week.
>>
>> Best,
>>
>> Jim
>>
>>
>>
>> On 7/9/2013 3:38 PM, cstrato wrote:
>>> Dear Jim,
>>>
>>> In xps I use as basic file for exon arrays the probeset annotation
>>> file and then compare the data to the data from the pgf-file. Any
>>> differences will be reported.
>>>
>>> I have just checked the different files for HuGene-2_0-st. If you
>>> check as an example the following probeset_ids:
>>> 16934607
>>> 16934608
>>> 16934609
>>> 16934610
>>>
>>> Then you will see that the transcript annotation file lists these ids
>>> as 'normgene->exon' and 'pos_control'. However, the probeset
>>> annotation file lists these ids as 'main' belonging to gene EIF3D with
>>> transcript_cluster_id 16934583. Looking for this id in the transcript
>>> annotation file reveals that the number of 'total_probes' is 24.
>>> Indeed, the probeset annotation file lists 24 probesets including the
>>> four above mentioned probeset_ids.
>>>
>>> This means that although these four probesets are listed in the
>>> transcript annotation file as 'normgene->exon' the label 'main' in the
>>> pgf-file is correct since these probesets are part of the gene EIF3D.
>>>
>>> Interestingly, the pgf-file for HuGene-1_0-st has extra probesets
>>> listed as 'normgene->exon'. However, in this case these probesets are
>>> also listed as 'normgene->exon' in the probeset annotation file, i.e.
>>> these probesets do not belong to any transcript listed in the probeset
>>> annotation file.
>>>
>>> Best regards,
>>> Christian
>>>
>>>
>>> On 7/9/13 8:46 PM, James W. MacDonald wrote:
>>>> Hi Christian,
>>>>
>>>> That's not the issue. Instead, the issue is that the pgf file lists the
>>>> normgene->exon probeset IDs as being 'main'. I have received a response
>>>> from Affy stating that the qcc file lists the normgene->exon probesets
>>>> as pos_control, but that seems orthogonal to the issue at hand.
>>>>
>>>> > qcc <- read.table("HuGene-2_0-st.qcc", comment.char="#",
>>>> stringsAsFactors=F, header=T)
>>>> > pgf <- readPgf("HuGene-2_0-st.pgf")
>>>> > head(qcc)
>>>> probeset_id group_name probeset_name quantification_in_header
>>>> 1 16650001 neg_control 16650001 0
>>>> 2 16650003 neg_control 16650003 0
>>>>
>>>> ## get the positive controls (normgene->exon probesets) from the qcc
>>>> file
>>>> > pos_cont <- qcc[qcc[,2] == "pos_control",1]
>>>>
>>>> ## compare to pgf file
>>>> > x <- pgf$probesetType[pgf$probesetId %in% pos_cont]
>>>> > table(x)
>>>> x
>>>> main
>>>> 1626
>>>>
>>>> So in the pgf file, these probesets are being called 'main' instead of
>>>> some sort of control. How do you handle this in xps? Do you use the pgf
>>>> file?
>>>>
>>>> Best,
>>>>
>>>> Jim
>>>>
>>>>
>>>>
>>>>
>>>> On 7/9/2013 2:06 PM, cstrato wrote:
>>>>> Dear Jim,
>>>>>
>>>>> I did not really follow the discussion so I may be wrong, but if you
>>>>> mean that there is a difference between the number of 'main' types,
>>>>> please note that number of 'main' for pgf, i.e 349012, corresponds to
>>>>> the number of 'main' in the probeset annotation file and not in the
>>>>> transcript annotation file.
>>>>>
>>>>> But as I said, I may have misunderstood the problem.
>>>>>
>>>>> I am mainly replying because at the beginning of this year I had long
>>>>> discussions with DevNet to make sure that the annotation files for the
>>>>> 2.X arrays are correct, and in version na33.2 DevNet did correct
>>>>> everything what I have found.
>>>>>
>>>>> Best regards,
>>>>> Christian
>>>>>
>>>>>
>>>>> On 7/9/13 7:13 PM, James W. MacDonald wrote:
>>>>>> Hi Mark,
>>>>>>
>>>>>> Thanks for the heads-up. We already knew that Affy messed up the
>>>>>> transcript and probeset annotation files (and had them fixed), but
>>>>>> didn't think I needed to check the others. Famous last words, no?
>>>>>>
>>>>>> > x <- readPgf("HuGene-2_0-st.pgf")
>>>>>> > table(x$probesetType)
>>>>>>
>>>>>> control->affx control->affx->bac_spike
>>>>>> 18 18
>>>>>> control->affx->ercc_spike control->affx->polya_spike
>>>>>> 92 39
>>>>>> control->bgp->antigenomic main
>>>>>> 23 349012
>>>>>> normgene->intron reporter
>>>>>> 3575 82
>>>>>>
>>>>>> > y <- read.csv("HuGene-2_0-st-v1.na33.2.hg19.transcript.csv",
>>>>>> comment.char = "#", stringsAsFactors=FALSE, header = TRUE)
>>>>>> > table(y$category)
>>>>>>
>>>>>> control->affx control->affx->bac_spike
>>>>>> 18 18
>>>>>> control->affx->ercc-spike control->affx->polya_spike
>>>>>> 92 39
>>>>>> control->bgp->antigenomic main
>>>>>> 23 44629
>>>>>> normgene->exon normgene->intron
>>>>>> 1626 3575
>>>>>> reporter rescue
>>>>>> 82 3515
>>>>>>
>>>>>> I'll ping Affymetrix and see what they have to say.
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>> Jim
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 7/9/2013 3:29 AM, Mark Cowley wrote:
>>>>>>> Dear Benilton, James& Bioconductors,
>>>>>>> Thanks for providing the platform design packages for
>>>>>>> hugene/mogene/ragene 1.0/1.1/2.0/2.1 arrays.
>>>>>>>
>>>>>>> I use SQL to query these packages& ultimately retain only 'main'
>>>>>>> probes in my analysis. This works well for 1.0 and 1.1 packages, but
>>>>>>> nor for 2.0 and 2.1 ST arrays. For 2.0 and 2.1 arrays, the
>>>>>>> normgene->exon control probes are misclassified as 'main' probes.
>>>>>>>
>>>>>>> evidence: the HuGene-2_0-st-v1.na33.2.hg19.transcript.csvNetAffx csv
>>>>>>> files lists 1626 normgene->exon probes, however the pg.hugene.2.0.st
>>>>>>> package lists 0, and assigns these 1626 probes to the 'main'
>>>>>>> category:
>>>>>>>
>>>>>>> # probe types:
>>>>>>> library(pd.hugene.2.0.st)
>>>>>>> conn<- db(pd.hugene.2.0.st)
>>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>> type type_id
>>>>>>> 1 1 main
>>>>>>> 2 2 control->affx
>>>>>>> 3 3 control->chip
>>>>>>> 4 4 control->bgp->antigenomic
>>>>>>> 5 5 control->bgp->genomic
>>>>>>> 6 6 normgene->exon
>>>>>>> 7 7 normgene->intron
>>>>>>> 8 8 rescue->FLmRNA->unmapped
>>>>>>> 9 9 control->affx->bac_spike
>>>>>>> 10 10 oligo_spike_in
>>>>>>> 11 11 r1_bac_spike_at
>>>>>>>
>>>>>>> # probe counts for each of the probe categories:
>>>>>>> dbGetQuery(conn,"SELECT type, count(*) from featureSet GROUP BY
>>>>>>> type")
>>>>>>> type count(*)
>>>>>>> 1 NA 3728
>>>>>>> 2 1 345497
>>>>>>> 3 2 18
>>>>>>> 4 4 23
>>>>>>> 5 7 3575
>>>>>>> 6 9 18
>>>>>>>
>>>>>>> NB: no type 6 probes.
>>>>>>> I've tested all 12 ho/mo/ra gene 1.0,1.1,2.0,2.1 ST packages, and see
>>>>>>> this issue for all 2.0 and 2.1 arrays (see below)
>>>>>>>
>>>>>>> Can these mappings please be updated?
>>>>>>>
>>>>>>> PS, there's a bunch of probes with type = NA in the database. I
>>>>>>> haven't investigated these in any detail.
>>>>>>>
>>>>>>> cheers,
>>>>>>> Mark
>>>>>>> -----------------------------------------------------
>>>>>>> Mark Cowley, PhD
>>>>>>>
>>>>>>> Genome Informatics Division& the Centre for Clinical Genomics
>>>>>>> The Kinghorn Cancer Centre, Garvan Institute of Medical Research,
>>>>>>> Sydney, Australia
>>>>>>> -----------------------------------------------------
>>>>>>>
>>>>>>> All 12 packages below:
>>>>>>> pd.packages<- c(
>>>>>>> "pd.hugene.1.0.st.v1", "pd.hugene.1.1.st.v1", "pd.hugene.2.0.st",
>>>>>>> "pd.hugene.2.1.st",
>>>>>>> "pd.mogene.1.0.st.v1", "pd.mogene.1.1.st.v1", "pd.mogene.2.0.st",
>>>>>>> "pd.mogene.2.1.st",
>>>>>>> "pd.ragene.1.0.st.v1", "pd.ragene.1.1.st.v1", "pd.ragene.2.0.st",
>>>>>>> "pd.ragene.2.1.st"
>>>>>>> )
>>>>>>>
>>>>>>> a<- b<- list()
>>>>>>> for(pd.pkg.name in pd.packages) {
>>>>>>> try({
>>>>>>> require(pd.pkg.name, character.only=TRUE) || stop("Can't load
>>>>>>> the
>>>>>>> pd.package")
>>>>>>> conn<- db(get(pd.pkg.name))
>>>>>>> a[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT type, count(*) from
>>>>>>> featureSet GROUP BY type")
>>>>>>> b[[pd.pkg.name]]<- dbGetQuery(conn,"SELECT fsetid from
>>>>>>> featureSet
>>>>>>> WHERE type = 6")[,1]
>>>>>>> })
>>>>>>> }
>>>>>>> dbGetQuery(conn,"SELECT * from type_dict")
>>>>>>>
>>>>>>>> a
>>>>>>> $pd.hugene.1.0.st.v1
>>>>>>> type count(*)
>>>>>>> 1 NA 227
>>>>>>> 2 1 253002
>>>>>>> 3 2 57
>>>>>>> 4 4 45
>>>>>>> 5 6 1195
>>>>>>> 6 7 2904
>>>>>>>
>>>>>>> $pd.hugene.1.1.st.v1
>>>>>>> type count(*)
>>>>>>> 1 NA 227
>>>>>>> 2 1 253002
>>>>>>> 3 2 57
>>>>>>> 4 4 45
>>>>>>> 5 6 1195
>>>>>>> 6 7 2904
>>>>>>>
>>>>>>> $pd.hugene.2.0.st
>>>>>>> type count(*)
>>>>>>> 1 NA 3728
>>>>>>> 2 1 345497
>>>>>>> 3 2 18
>>>>>>> 4 4 23
>>>>>>> 5 7 3575
>>>>>>> 6 9 18
>>>>>>>
>>>>>>> $pd.hugene.2.1.st
>>>>>>> type count(*)
>>>>>>> 1 NA 3728
>>>>>>> 2 1 345497
>>>>>>> 3 2 18
>>>>>>> 4 4 23
>>>>>>> 5 7 3575
>>>>>>> 6 9 18
>>>>>>>
>>>>>>> $pd.mogene.1.0.st.v1
>>>>>>> type count(*)
>>>>>>> 1 NA 86
>>>>>>> 2 1 234878
>>>>>>> 3 2 21
>>>>>>> 4 4 45
>>>>>>> 5 6 1324
>>>>>>> 6 7 5222
>>>>>>>
>>>>>>> $pd.mogene.1.1.st.v1
>>>>>>> type count(*)
>>>>>>> 1 NA 86
>>>>>>> 2 1 234878
>>>>>>> 3 2 21
>>>>>>> 4 4 45
>>>>>>> 5 6 1324
>>>>>>> 6 7 5222
>>>>>>>
>>>>>>> $pd.mogene.2.0.st
>>>>>>> type count(*)
>>>>>>> 1 NA 810
>>>>>>> 2 1 263551
>>>>>>> 3 2 18
>>>>>>> 4 4 23
>>>>>>> 5 7 5331
>>>>>>> 6 9 18
>>>>>>>
>>>>>>> $pd.mogene.2.1.st
>>>>>>> type count(*)
>>>>>>> 1 NA 810
>>>>>>> 2 1 263551
>>>>>>> 3 2 18
>>>>>>> 4 4 23
>>>>>>> 5 7 5331
>>>>>>> 6 9 18
>>>>>>>
>>>>>>> $pd.ragene.1.0.st.v1
>>>>>>> type count(*)
>>>>>>> 1 NA 254
>>>>>>> 2 1 211195
>>>>>>> 3 2 21
>>>>>>> 4 4 45
>>>>>>> 5 6 399
>>>>>>> 6 7 1153
>>>>>>>
>>>>>>> $pd.ragene.1.1.st.v1
>>>>>>> type count(*)
>>>>>>> 1 NA 254
>>>>>>> 2 1 211195
>>>>>>> 3 2 21
>>>>>>> 4 4 45
>>>>>>> 5 6 399
>>>>>>> 6 7 1153
>>>>>>>
>>>>>>> $pd.ragene.2.0.st
>>>>>>> type count(*)
>>>>>>> 1 NA 1071
>>>>>>> 2 1 214018
>>>>>>> 3 2 18
>>>>>>> 4 4 23
>>>>>>> 5 7 5083
>>>>>>> 6 9 18
>>>>>>>
>>>>>>> $pd.ragene.2.1.st
>>>>>>> type count(*)
>>>>>>> 1 NA 1071
>>>>>>> 2 1 214018
>>>>>>> 3 2 18
>>>>>>> 4 4 23
>>>>>>> 5 7 5083
>>>>>>> 6 9 18
>>>>>>>
>>>>>>>> sapply(b,length)
>>>>>>> pd.hugene.1.0.st.v1 pd.hugene.1.1.st.v1 pd.hugene.2.0.st
>>>>>>> pd.hugene.2.1.st
>>>>>>> 1195 1195
>>>>>>> 0 0
>>>>>>> pd.mogene.1.0.st.v1 pd.mogene.1.1.st.v1 pd.mogene.2.0.st
>>>>>>> pd.mogene.2.1.st
>>>>>>> 1324 1324
>>>>>>> 0 0
>>>>>>> pd.ragene.1.0.st.v1 pd.ragene.1.1.st.v1 pd.ragene.2.0.st
>>>>>>> pd.ragene.2.1.st
>>>>>>> 399 399
>>>>>>> 0 0
>>>>>>>
>>>>>>>> sessionInfo()
>>>>>>> R version 3.0.0 (2013-04-03)
>>>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>>>
>>>>>>> locale:
>>>>>>> [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C
>>>>>>> [3] LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8
>>>>>>> [5] LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
>>>>>>> [7] LC_PAPER=C LC_NAME=C
>>>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>>>> [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
>>>>>>>
>>>>>>> attached base packages:
>>>>>>> [1] parallel stats graphics grDevices utils datasets
>>>>>>> methods
>>>>>>> [8] base
>>>>>>>
>>>>>>> other attached packages:
>>>>>>> [1] pd.ragene.2.1.st_2.12.1 pd.ragene.2.0.st_2.12.0
>>>>>>> [3] pd.ragene.1.1.st.v1_3.8.0 pd.ragene.1.0.st.v1_3.8.0
>>>>>>> [5] pd.mogene.2.1.st_2.12.1 pd.mogene.2.0.st_2.12.0
>>>>>>> [7] pd.mogene.1.1.st.v1_3.8.0 pd.mogene.1.0.st.v1_3.8.0
>>>>>>> [9] pd.hugene.2.1.st_3.8.0 pd.hugene.1.1.st.v1_3.8.0
>>>>>>> [11] pd.hugene.1.0.st.v1_3.8.0 pd.hugene.2.0.st_3.8.0
>>>>>>> [13] oligo_1.24.0 Biobase_2.20.0
>>>>>>> [15] oligoClasses_1.22.0 BiocGenerics_0.6.0
>>>>>>> [17] RSQLite_0.11.4 DBI_0.2-7
>>>>>>> [19] BiocInstaller_1.10.2
>>>>>>>
>>>>>>> loaded via a namespace (and not attached):
>>>>>>> [1] affxparser_1.32.1 affyio_1.28.0 Biostrings_2.28.0
>>>>>>> [4] bit_1.1-10 codetools_0.2-8 ff_2.2-11
>>>>>>> [7] foreach_1.4.1 GenomicRanges_1.12.3 IRanges_1.18.1
>>>>>>> [10] iterators_1.0.6 preprocessCore_1.22.0 splines_3.0.0
>>>>>>> [13] stats4_3.0.0 tools_3.0.0 zlibbioc_1.6.0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Bioconductor mailing list
>>>>>>> Bioconductor at r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>>> Search the archives:
>>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
More information about the Bioconductor
mailing list