[BioC] Whole genome amino acid composition package?

Hervé Pagès hpages at fhcrc.org
Mon Oct 21 22:28:57 CEST 2013


On 10/21/2013 12:57 PM, Bernardo Bello wrote:
> Dear Hervé,
>
> Thank you very much for your help, it seems you handle R with ease. Wow!
> Hope one day I would manage R this way.
>
> I forgot to tell that I need to input R tens if bacterial genomes at the
> same tine and get statistical codon usage for each batch. Is that
> possible to automate this in your described workflow?

You give so little details (in particular you don't say where your data
is coming from and/or what it looks like) that it's hard to tell for
sure, but I would say: why not?

More precisely: assuming you've managed to solve the problem for 1
bacterial genome (i.e. you now have a function that summarizes codon
usage for a given bacterial genome), then I don't see any reason why
you couldn't apply this function to a batch of genomes and combine
the results together. Depending on the format you use for the
individual codon usage summaries, combining the results might be
trivial (e.g. could just consist of rbind()'ing the rows of a 64-col
table) or a little bit more complicated... Again, hard to know without
you providing more details.

H.

>
> Best, Bernardo
>
>
> 2013/10/21 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>
>     Hi Bernardo,
>
>
>     On 10/21/2013 03:41 AM, Bernardo Bello wrote:
>
>         Hi there,
>
>         I'm interested in a statistical tool to get bacterial amino acid
>         composition and genetic code usage at genome level.
>
>         I've looking in MESH terms database but I'm a bit lost.
>         http://www.ncbi.nlm.nih.gov/__pubmed?term=%28%22Genetic%__20Code%22%5BMesh%5D%29%20AND%__20%22Software%22%5BMesh%5D
>         <http://www.ncbi.nlm.nih.gov/pubmed?term=%28%22Genetic%20Code%22%5BMesh%5D%29%20AND%20%22Software%22%5BMesh%5D>
>
>
>     Assuming you've access to the annotated DNA sequence for your bacteria,
>     you would need:
>
>        (a) A way to load the bacterial genome in Bioconductor as a
>            DNAString object.
>
>            Some suggested tools:
>              - readDNAStringSet() in Biostrings.
>              - scanFa() in Rsamtools.
>              - import.2bit() in rtracklayer.
>              - The biomaRt package to retrieve the bacterial genome from
>                an online database via a Mart service.
>              - You might find if convenient to wrap your bacterial genomes
>                into a BSgenome data package like we've done for E. coli
>                (see the BSgenome.Ecoli.NCBI.20080805 package).
>
>        (b) A way to load the gene/CDS coordinates as a GRanges or
>     GRangesList
>            object. You need to make sure those coordinates are relative to
>            the exact same genome assembly you used in (a).
>
>            Some suggested tools:
>              - import.gff() and import.bed() in rtracklayer.
>              - makeTranscriptDbFromGFF() in GenomicFeatures.
>              - The biomaRt package to retrieve the gene coordinates from
>                an online database via a Mart service. Actually, if you're
>                going to use biomaRt, you can skip that step and retrieve
>                directly the gene/CDS sequences. See (c).
>
>        (c) A way to extract the gene/CDS sequences as a DNAStringSet object.
>            One gotcha for this step is to make sure that the sequences are
>            extracted from the right strand i.e. if a gene is on the minus
>            strand and its sequence is extracted from the plus strand, then
>            the sequence needs to be reverse complemented (with
>            reverseComplement() from the Biostrings package).
>
>            Some suggested tools:
>              - extractAt() from the Biostrings package. Does NOT handle
>                the strand for you (i.e. you need to reverse complement
>                sequences for genes on the minus strand).
>              - The biomaRt package to retrieve the gene/CDS sequences from
>                an online database via a Mart service.
>              - getSeq() from the BSgenome package (if the DNA sequences are
>                stored in a BSgenome data package). It does handle the
>                strand for you.
>              - Just mentioning it FYI: for more complex organisms
>                with introns and UTRs, extractTranscriptsFromGenome() in
>                GenomicFeatures would be the tool to use to extract the
>                transcript or CDS sequences from a BSgenome object and a
>                TranscriptDb object. It handles the strand and removes the
>                introns for you.
>
>         (d) Once you have the gene/CDS sequences in a DNAStringSet object,
>             you can use translate() (from Biostrings) to translate to
>             proteins. Right now it uses the Standard Genetic Code to
>             translate but it would be easy to add support for alternate
>             genetic codes. The result of the translation is an AAStringSet
>             object. You can also use extractAt() on the DNAStringSet object
>             to extract the codon composition of the gene/CDS sequences as
>             a DNAStringSetList object. A simple table(unlist()) on the
>             DNAStringSetList object will give you condon usage.
>
>     I don't know if there is already a package in Bioconductor or on CRAN
>     that does the above for you but all the building blocks are here, and
>     it should not be too hard to put them together. If you decide to go that
>     route and need more help, please come back, hopefully this time with
>     more specific questions.
>
>     Cheers,
>     H.
>
>
>         Thanks for your help. All the best, Bernardo
>
>
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
>
> *Bernardo Bello Ortí*
>
> PhD student
>
> CReSA-IRTA
>
> Campus de Bellaterra-Universitat Autònoma de Barcelona
>
> Edifici CReSA
>
> 08193  Bellaterra (Barcelona, Spain)
>
> Tel.: 647 42 52 63 *www.cresa.es <http://www.cresa.es> *
>
> *
> *
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list