[Bioc-sig-seq] ChIPpeakAnno, BioMart, getAnnotation 'Exon' error message

Wed Mar 17 15:48:25 CET 2010

Julie

why do you say that "the database contains errors" ? I had a look at
http://gbrowse.arabidopsis.org/cgi-bin/gbrowse/arabidopsis/?name=AT1G68552.1
and while this is perhaps a complex locus whose expression we have not 
yet fully understood, or not yet properly formalised into the database's 
ontology of genomic features and gene products, I am not sure "error" is 
the right term for that.

Arabidopsis people might have more insight on that.

	Wolfgang

Zhu, Julie scripsit 16/03/10 22:56:
> Hi,
> 
> I obtained the exon sequences and here are the duplicate exon IDs with different descriptions.
> 
> TSS[duplicated(TSS[,1]), 1]
>  [1] "AT1G68552.1-E12203"  "AT1G64140.1-E14755"  "AT1G64140.1-E14756"  "AT1G70780.1-E4116"
>  [5] "AT1G75390.1-E22428"  "AT1G06149.1-E1988"   "AT1G36730.1-E35050"  "AT1G36730.1-E35051"
>  [9] "AT1G29952.1-E5728"   "AT1G29952.1-E5730"   "AT1G29952.1-E5732"   "AT1G29970.2-E8863"
> [13] "AT1G29970.2-E8864"   "AT1G64628.1-E10574"  "AT1G25470.1-E20679"  "AT1G58120.1-E18468"
> [17] "AT1G29041.1-E15117"  "AT1G23149.1-E13728"  "AT1G29952.1-E5728"   "AT1G29952.1-E5732"
> [21] "AT2G18162.1-E49029"  "AT3G51632.1-E98183"  "AT3G22970.1-E89708"  "AT3G45240.2-E86808"
> [25] "AT3G18000.1-E98438"  "AT3G59052.1-E77046"  "AT3G62422.1-E76351"  "AT3G25570.1-E88575"
> [29] "AT3G25570.1-E88576"  "AT3G10910.1-E77164"  "AT3G02468.1-E88931"  "AT3G12010.1-E78704"
> [33] "AT3G01470.1-E92685"  "AT3G53402.1-E93478"  "AT3G26430.1-E85151"  "AT3G26430.1-E85154"
> [37] "AT4G19110.1-E121565" "AT4G22592.1-E113550" "AT4G22592.1-E113551" "AT4G22592.1-E113552"
> [41] "AT4G12430.1-E113931" "AT4G12430.1-E113932" "AT4G12430.1-E113933" "AT4G25670.1-E111076"
> [45] "AT4G25670.1-E111077" "AT4G36990.1-E122859" "AT4G14620.1-E120308" "AT4G34590.1-E116802"
> [49] "AT5G09460.1-E136355" "AT5G09460.1-E136357" "AT5G50010.1-E151574" "AT5G50010.1-E151576"
> [53] "AT5G50010.1-E151574" "AT5G50011.1-E153108" "AT5G50011.1-E153110" "AT5G09460.1-E136355"
> [57] "AT5G09463.1-E151757" "AT5G09463.1-E151758" "AT5G52552.1-E136887" "AT5G52552.1-E136888"
> [61] "AT5G41992.1-E154552" "AT5G64341.1-E144370" "AT5G64341.1-E144371" "AT5G64341.1-E144373"
> [65] "AT5G64341.1-E144370" "AT5G64341.1-E144371" "AT5G64343.1-E148873" "AT5G64341.1-E144373"
> [69] "AT5G09460.1-E136355" "AT5G09463.1-E151757" "AT5G09460.1-E136357" "AT5G09463.1-E151758"
> [73] "AT5G49448.1-E171824" "AT5G05282.1-E152619" "AT5G53588.1-E159453" "AT5G09670.2-E157563"
> [77] "AT5G01710.1-E140929" "AT5G64341.1-E144370" "AT5G64343.1-E148873" "AT5G61230.1-E153842"
> [81] "AT5G61230.1-E153843" "AT5G60550.1-E140873" "AT5G64552.1-E148753" "AT5G64552.1-E148754"
> [85] "AT5G45430.1-E151338"
> 
> For example,
> 
> TSS[TSS[,1]=="AT1G68552.1-E12203",]
>          ensembl_exon_id chromosome_name exon_chrom_start exon_chrom_end strand
> 3125  AT1G68552.1-E12203               1         25727627       25727701     -1
> 15537 AT1G68552.1-E12203               1         25727627       25727701     -1
>                                                                                                                                                                                                                                                                                                                                                             description
> 3125  CPuORF53 (Conserved peptide upstream open reading frame 53); Upstream open reading frames (uORFs) are small open reading frames found in the 5' UTR of a mature mRNA, and can potentially mediate translational regulation of the largest, or major, ORF (mORF). CPuORF53 represents a conserved upstream opening reading frame relative to major ORF AT1G68550.1
> 15537                                                                                                 AP2 domain-containing transcription factor, putative; encodes a member of the ERF (ethylene response factor) subfamily B-6 of ERF/AP2 transcription factor family. The protein contains one AP2 domain. There are 12 members in this subfamily including RAP2.11.
> 
> So I think the database contains errors. In this case, it will require manual curation to determine which row to choose. Did you contact ensembl about this? Thanks!
> 
> Best regards,
> 
> Julie
> 
> 
> *******************************************
> Lihua Julie Zhu, Ph.D
> Research Associate Professor
> Program Gene Function and Expression
> University of Massachusetts Medical School
> 364 Plantation Street, Room 613
> Worcester, MA 01605
> 508-856-5256
> http://www.umassmed.edu/pgfe/faculty/zhu.cfm
> *******************************************
> 
> On 3/5/10 6:46 PM, "pterry at huskers.unl.edu" <pterry at huskers.unl.edu> wrote:
> 
> 
> 
>  Dear bioc-sig-sequencing,
> 
> I would like to annotate chip-seq peaks for the arabidopsis genome.  "TSS" and "Exon" are two of the arguments for the 'getAnnotation' function.  The "TSS" argument succeeded, but the "Exon" argument failed.
> 
> ...
>> arabdset<-useMart(biomart="plant_mart_4", dataset = "athaliana_eg_gene")
> Checking attributes ... ok
> Checking filters ... ok
>> ExonArabAnno<-getAnnotation(arabdset, featureType="Exon")
> Error in `rownames<-`(`*tmp*`, value = c("ATCG00010.1-E176369", "ATMG00010.1-E176520",  :
>   duplicate rownames not allowed
> 
>> sessionInfo()
> R version 2.11.0 Under development (unstable) (2010-02-28 r51186)
> x86_64-unknown-linux-gnu
> 
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
>  [1] ChIPpeakAnno_1.3.4                  org.Hs.eg.db_2.3.6
>  [3] GO.db_2.3.5                         RSQLite_0.8-3
>  [5] DBI_0.2-5                           AnnotationDbi_1.9.4
>  [7] BSgenome.Ecoli.NCBI.20080805_1.3.16 BSgenome_1.15.11
>  [9] Biostrings_2.15.22                  IRanges_1.5.51
> [11] multtest_2.3.0                      Biobase_2.7.4
> [13] biomaRt_2.3.4
> 
> loaded via a namespace (and not attached):
> [1] MASS_7.3-5      RCurl_1.3-1     splines_2.11.0  survival_2.35-8
> [5] tools_2.11.0    XML_2.6-0
> 
> Can someone comment?
> 
> 
> Thanks,
> P. Terry
> pterry at huskers.unl.edu
> 
>         [[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
> 
> 
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioc-sig-sequencing mailing list
> Bioc-sig-sequencing at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing

-- 

Best wishes
      Wolfgang

--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact