[BioC] advice on Biostrings
Wolfgang Huber
huber at ebi.ac.uk
Wed Feb 22 11:06:11 CET 2006
Rafael A Irizarry wrote:
> hi im using biostrings to count base content as well as pair of bases
> content. im using the following sniped of code:
>
Hi Rafa,
to count symbols in character vectors, matchprobes:basecontent is fast:
library(matchprobes)
v = c("AAACT", "GGGTT", "ggAtT")
bc = basecontent(v)
print.default(bc)
bc[,"C"]+bc[,"G"]
and if there is interest I'd be happy amend the C code to also count
pairs of bases (or you could, it is not terribly complicated).
Cheers
Wolfgang
>
> ###pmseq is a vector of character strings (not of the same nchar).
> tmp <- sapply(pmseq,function(x){
> y = DNAString(x)
> c(alphabetFrequency(y)[2:5], ##count A,T,G,C
> length(matchDNAPattern("GC",y))+length(matchDNAPattern("CG",y)))
> ##count GC or CG
> })
>
> it is painfully slow. strsplit and grep were much faster for the first
> part (counting bases) but the using grep for the second part was not
> straight forward.
>
> any suggestions?
-------------------------------------
Wolfgang Huber
European Bioinformatics Institute
European Molecular Biology Laboratory
Cambridge CB10 1SD
England
Phone: +44 1223 494642
Fax: +44 1223 494486
Http: www.ebi.ac.uk/huber
More information about the Bioconductor
mailing list