[BioC] Whole genome amino acid composition package?
Hervé Pagès
hpages at fhcrc.org
Mon Oct 21 20:53:45 CEST 2013
Hi Bernardo,
On 10/21/2013 03:41 AM, Bernardo Bello wrote:
> Hi there,
>
> I'm interested in a statistical tool to get bacterial amino acid
> composition and genetic code usage at genome level.
>
> I've looking in MESH terms database but I'm a bit lost.
> http://www.ncbi.nlm.nih.gov/pubmed?term=%28%22Genetic%20Code%22%5BMesh%5D%29%20AND%20%22Software%22%5BMesh%5D
Assuming you've access to the annotated DNA sequence for your bacteria,
you would need:
(a) A way to load the bacterial genome in Bioconductor as a
DNAString object.
Some suggested tools:
- readDNAStringSet() in Biostrings.
- scanFa() in Rsamtools.
- import.2bit() in rtracklayer.
- The biomaRt package to retrieve the bacterial genome from
an online database via a Mart service.
- You might find if convenient to wrap your bacterial genomes
into a BSgenome data package like we've done for E. coli
(see the BSgenome.Ecoli.NCBI.20080805 package).
(b) A way to load the gene/CDS coordinates as a GRanges or GRangesList
object. You need to make sure those coordinates are relative to
the exact same genome assembly you used in (a).
Some suggested tools:
- import.gff() and import.bed() in rtracklayer.
- makeTranscriptDbFromGFF() in GenomicFeatures.
- The biomaRt package to retrieve the gene coordinates from
an online database via a Mart service. Actually, if you're
going to use biomaRt, you can skip that step and retrieve
directly the gene/CDS sequences. See (c).
(c) A way to extract the gene/CDS sequences as a DNAStringSet object.
One gotcha for this step is to make sure that the sequences are
extracted from the right strand i.e. if a gene is on the minus
strand and its sequence is extracted from the plus strand, then
the sequence needs to be reverse complemented (with
reverseComplement() from the Biostrings package).
Some suggested tools:
- extractAt() from the Biostrings package. Does NOT handle
the strand for you (i.e. you need to reverse complement
sequences for genes on the minus strand).
- The biomaRt package to retrieve the gene/CDS sequences from
an online database via a Mart service.
- getSeq() from the BSgenome package (if the DNA sequences are
stored in a BSgenome data package). It does handle the
strand for you.
- Just mentioning it FYI: for more complex organisms
with introns and UTRs, extractTranscriptsFromGenome() in
GenomicFeatures would be the tool to use to extract the
transcript or CDS sequences from a BSgenome object and a
TranscriptDb object. It handles the strand and removes the
introns for you.
(d) Once you have the gene/CDS sequences in a DNAStringSet object,
you can use translate() (from Biostrings) to translate to
proteins. Right now it uses the Standard Genetic Code to
translate but it would be easy to add support for alternate
genetic codes. The result of the translation is an AAStringSet
object. You can also use extractAt() on the DNAStringSet object
to extract the codon composition of the gene/CDS sequences as
a DNAStringSetList object. A simple table(unlist()) on the
DNAStringSetList object will give you condon usage.
I don't know if there is already a package in Bioconductor or on CRAN
that does the above for you but all the building blocks are here, and
it should not be too hard to put them together. If you decide to go that
route and need more help, please come back, hopefully this time with
more specific questions.
Cheers,
H.
>
> Thanks for your help. All the best, Bernardo
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list