[BioC] Arabidopsis chromosome location mappings

Marc Carlson mcarlson at fhcrc.org
Fri Oct 10 23:23:02 CEST 2008


Hi Herve,

That is a good point, but if we wanted to use that in our annotation
packages, then we would have to "connect" the appropriate BSGenome
package with the matching Annotation package (to make sure that both
were based on the exact same build of a particular genome).  Right now,
users would have be careful to see if the build in the BSGenome package
below matched up with the build used for the annotation data provided
(also from TAIR) in the arabidopsis annotation packages before they used
this information.

For now, I have written to the folks at TAIR to see if they have an FTP
file that I can use to go with the rest of their genome annotations that
are currently in the arabidopsis packages.  So far they have not
responded, but it might just take them a couple of weeks to get back to
me.  It was nice what Cara pointed out about the TAIR HTML page, but
this is hardly ideal as I have to know that the information in the TAIR
web page will be perfectly synced up with the latest information in the
TAIR FTP site where our annotation pipeline gets all of its regularly
updated TAIR data from.  The chromosome lengths are unlikely to change
very much from one build to the next, but I think we should still insist
on getting matching start/stop locations and length information from a
single build into each package.


  Marc




Herve Pages wrote:
> Hi Sam, Marc, Cara,
>
> Another way to get the chromosome lengths is to load the Arabidopsis
> genome
> and to use the seqlengths() function on it:
>
> > library(BSgenome.Athaliana.TAIR.04232008)
> > Athaliana
> > seqlengths(Athaliana)
>     chr1     chr2     chr3     chr4     chr5     chrC     chrM
> 30432563 19705359 23470805 18585042 26992728   154478   366924
>
> seqlengths() is new in BioC 2.3 (our next release, scheduled in less than
> 2 weeks) so make sure you use the current devel version of
> Bioconductor for now.
>
> Also, BSgenome.Athaliana.TAIR.04232008 is new in BioC 2.3 so now 2
> versions
> of this genome are available: the snapshot from January 22, 2004 and
> the snapshot
> from April 23, 2008. Note that the names of the chromosomes have
> changed between
> the 2 versions but their lengths remain the same.
>
> See ?Athaliana for the details on which files were used to make this
> BSgenome
> data package.
>
> Use available.genomes() from the BSgenome software package to get the
> list of
> all BSgenome data packages that are currently available on the
> Bioconductor
> repositories for your version of R/Bioconductor.
>
> Cheers,
> H.
>
>
> Cara Winter wrote:
>> Marc,
>>
>> TAIR (www.arabidopsis.org) is the official source for all Arabidopsis
>> sequence and annotation information.  Here is a link that contains
>> the chromosome lengths and other genome assembly information:
>>
>> http://www.arabidopsis.org/portals/genAnnotation/gene_structural_annotation/agicomplete.jsp
>>
>>
>> Any questions regarding Arabidopsis sequence data can be sent to
>> curator at arabidopsis.org.  Thank you very much for the interest in
>> including Arabidopsis data into the Bioconductor packages.
>>
>> Best, Cara
>>
>> -- 
>> Cara Winter
>> Cell and Molecular Biology Graduate Group
>> University of Pennsylvania School of Medicine
>> Philadelphia, PA  19104
>> Phone: 215-266-1703
>> email: caramw at mail.med.upenn.edu
>>
>> ----- Original Message -----
>> From: "Marc Carlson" <mcarlson at fhcrc.org>
>> To: "Samuel Wuest" <wuests at tcd.ie>
>> Cc: bioconductor at stat.math.ethz.ch
>> Sent: Monday, October 6, 2008 12:11:52 PM GMT -05:00 US/Canada Eastern
>> Subject: Re: [BioC] Arabidopsis chromosome location mappings
>>
>> Hi Samuel,
>>
>> The CHRLENGTHS mapping would just be a vector of all named chromosome
>> lengths for Arabidopsis.  If we had one for arabidopsis, it would not
>> contain the the chromosome locations mappings for much of anything.  We
>> normally get CHRLENGTHS mapping information from UCSC, but unfortunately
>> they don't cover Arabidosis there, so we don't have a source for this
>> information.  But since all this is, is a named vector of the chromosome
>> lengths, then if you know this information, you could probably fill it
>> in pretty easily by just creating a named vector.  Also, if you have a
>> recommendation for a reliable public source of this information that is
>> considered trustworthy by the arabidopsis community for this, please
>> tell me about it so that we can know about it too.
>>
>> If you really want the location of the start of these genes along the
>> chromosomes, that information (from TAIR) is present in the
>> ath1121501CHRLOC mapping.  And if you want the ends, then you can find
>> those in the ath1121501CHRLOCEND mapping (but this last mapping is only
>> found in the most recent devel packages).
>> Please let me know if I answered your questions,
>>
>>
>>   Marc
>>
>>
>>
>>
>> Samuel Wuest wrote:
>>> Hi,
>>>
>>> Hope you're fine…
>>> I am trying to make whole genome plots using the geneplotter
>>> package/annotate package. The organism I am studying is Arabidopsis
>>> thaliana, and obviously the annotations are not so extensive there:
>>> when
>>> trying to build a chromLocation object, I can't obviously do that
>>> (see error
>>> below)
>>> It is obvious to me, that the chromosome location mappings are not
>>> provided
>>> in the Arabidopsis anntation package (see below).
>>>
>>> My question: is there any way of plotting Arabidopsis gene
>>> expression data
>>> along a chromosome. Should I just order the GeneIds (luckily, for
>>> the TAIR
>>> Ids one can infer the gene order along a chromosome)? Has anyone made a
>>> script for this?
>>>
>>> Thanks for any help, best wishes,
>>>
>>> Sam
>>>
>>>
>>>  
>>>> library(ath1121501.db)
>>>> newChrClass <- buildChromLocation("ath1121501")
>>>>     
>>> Error in get(mapName, envir = pkgEnv, inherits = FALSE) :
>>>   variable "ath1121501CHRLENGTHS" was not found
>>>
>>>  
>>>> objects("package:ath1121501.db")
>>>>     
>>>  [1] "ath1121501"             "ath1121501ACCNUM"
>>> "ath1121501ARACYC"       "ath1121501ARACYCENZYME" "ath1121501CHR"
>>>  [6] "ath1121501CHRLOC"       "ath1121501ENZYME"
>>> "ath1121501ENZYME2PROBE" "ath1121501GENENAME"     "ath1121501GO"
>>> [11] "ath1121501GO2ALLPROBES" "ath1121501GO2PROBE"
>>> "ath1121501MAPCOUNTS"    "ath1121501MULTIHIT"     "ath1121501ORGANISM"
>>> [16] "ath1121501PATH"         "ath1121501PATH2PROBE"
>>> "ath1121501PMID"         "ath1121501PMID2PROBE"   "ath1121501SYMBOL"
>>> [21] "ath1121501_dbInfo"      "ath1121501_dbconn"
>>> "ath1121501_dbfile"      "ath1121501_dbschema"
>>>
>>>  
>>>> sessionInfo()
>>>>     
>>> R version 2.7.0 (2008-04-22)
>>> i386-apple-darwin8.10.1
>>>
>>> locale:
>>> en_IE.UTF-8/en_IE.UTF-8/C/C/en_IE.UTF-8/en_IE.UTF-8
>>>
>>> attached base packages:
>>> [1] tools     stats     graphics  grDevices utils     datasets  methods
>>> base
>>>
>>> other attached packages:
>>>  [1] ath1121501.db_2.2.0  TinesATH1.db_1.0     geneplotter_1.18.0
>>> annotate_1.18.0      xtable_1.5-2         AnnotationDbi_1.2.0
>>>
>>>     [[alternative HTML version deleted]]
>>>
>>>  
>>> ------------------------------------------------------------------------
>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>



More information about the Bioconductor mailing list