[BioC] rho statistics for dinucleotide abundance from a sequence file
Steve Lianoglou
mailinglist.honeypot at gmail.com
Tue Jan 17 15:47:21 CET 2012
Hi,
Not sure about seqinr, but I took a crack at implementing the rho
statistic using the (super fast) functionality in Biostrings here:
https://gist.github.com/1626857
If you copy and paste that code into your R session, you should be
able o load up a FASTA file and calculate the rho statistic on each
sequence like so:
R> ## paste code from gist above
R> seqs <- read.DNAStringSet('/path/to/your/sequences.fa', 'fasta')
R> rho.stats <- rho(seqs)
`rho.stats` will be a matrix. The columns have the rho stat for the
dinucleotide identified in the column names of the matrx, the rows
have the rho stat for the corresponding (i'th) sequence in `seqs`
I'm calculating "rho" as described in seqnir's help pages. I *think*
I'm doing it correctly, but if you have some test sequences you know
the answer for, you should double check.
Perhaps this (or a modification of it) can be added to Biostrings
itself. I can supply a patch[1]
-steve
[1] Along w/ my other outsanding biostrings patch, sorry Herve! :-)
On Tue, Jan 17, 2012 at 2:44 AM, Utpal Bakshi [guest]
<guest at bioconductor.org> wrote:
>
> Hi all,
>
> I have a sequence file (fasta format) and want to calculate the rho statistics for dinucleotide abundance value on my data.. the code which I use is (using seqinr library and current working directory)
>
> seq_info<-read.fasta("gene.txt")
> rho(seq_info[1],2)
>
> but it yields only the dinucleotides, not their rho values, i.e,
>
>> rho(seq_info[1],2)
> aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
>
> I will be grateful if anyone solve this.. I've also attached the sequence below..
> Thanks in advance..
>
>
>
>>gi|270279749|gene0003
> ATGTATATGAGAAAGGAAGAGCCTAGCGGCTCAGACAAGATTATGACTTCAGTTGTTGTTGTAGGTACCCAATGGGGCGATGAAGGTAAAGGGAAAATTACAGATTTTCTTTCAGCTAATGCAGAGGTGATTGCTCGTTACCAAGGTGGTGATAATGCTGGTCACACAATTGTGATTGATGGCAAGAAATTTAAGTTGCACTTGATTCCATCTGGAATTTTCTTCCCTGAAAAAATTTCAGTTATTGGAAACGGTATGGTTGTAAACCCTAAATCACTTGTGAAAGAATTGTCTTATCTGCATGAAGAAGGTGTTACAACAGATAATCTACGTATCTCTGATCGTGCGCATGTTATTTTGCCTTACCACATTGAGTTGGATCGCTTGCAAGAAGAAGCTAAGGGTGATAATAAGATTGGTACTACAATAAAGGGAATTGGTCCAGCATATATGGACAAAGCTGCTCGTGTCGGGATTCGTATTGCAGATCTTTTGGATAAGGATATTTTCCGTGAACGCTTGGAACGCAATCTTGCGGAGAAGAATCGTCTGTTTGAAAAATTGTATGACAGTACTCCTATTTCAATTGATGATATTTTTGAAGAGTACTATGAGTATGGCCAACAAATTAAGCAGTATGTGACAGATACATCTGTTATTTTGAACGATGCGCTTGATAACGGCAAACGTGTGCTTTTTGAAGGTGCGCAAGGTGTCATGTTGGATATTGACCAAGGTACTTATCCATTTGTTACTTCTTCAAACCCTGTTGCTGGTGGTGTGACAATTGGGTCTGGTGTTGGTCCAAGTAAGATTGACAAGGTTGTAGGTGTTTGTAAAGCCTATACAAGTCGTGTAGGTGATGGACCTTTCCCAACTGAATTATTTGATGAAGTGGGAGATCGCATTCGTGAAGTAGGTCATGAGTATGGTACAACAACTGGCCGTCCACGTCGTGTGGGTTGGTTTGACTCAGTTGTGATGCGTCAC
> AGCCGTCGTGTATCTGGTATTACCAATCTTTCATTGAACTCTATCGATGTTTTGAGCGGTTTGGATACTGTGAAAATCTGTGTGGCCTATGATCTCGATGGTCAACGTATCGACCACTACCCAGCTAGTCTTGAACAGTTGAAACGTTGCAAACCTATCTACGAAGAATTGCCAGGGTGGTCAGAAGACATCACAGGAGTTCGTAATTTGGAAGATCTTCCTGAGAATGCGCGTAACTATGTTCGTCGTGTGAGTGAATTGGTTGGCGTTCGTATTTCGACATTCTCAGTAGGTCCTGGTCGTGAACAAACCAATATTTTAGAAAGTGTTTGGTCTTAA
>
> -- output of sessionInfo():
>
> R version 2.14.0 (2011-10-31)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_India.1252 LC_CTYPE=English_India.1252
> [3] LC_MONETARY=English_India.1252 LC_NUMERIC=C
> [5] LC_TIME=English_India.1252
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list