[BioC] Manipulating large DNAStringSet(s)
Steve Lianoglou
mailinglist.honeypot at gmail.com
Wed May 5 17:16:20 CEST 2010
Hi all,
I have a rather large DNAStringSet you could get from, say,
subselecting many intervals on Hsapiens$chr1 and I want to go through
the paces of reverseComplement-ing some of the ranges that correspond
to reads on the anti-sense strand.
I'm finding that doing this the "naive" way is taking a lot of time
and memory, but adding some extra code to change my target to a
character vector (from a DNAStringSet) with some careful subsetting
and reverseComplement-ing is much faster.
So I'm wondering if I'm either doing something wrong, or perhaps
something in DNAStringSet can be optimized to deal with this?
Like so:
R> library(BSgenome.Hsapiens.UCSC.hg18)
R> ranges <- IRanges(seq(1, seqlengths(Hsapiens)['chr1'] - 100,
length.out=2141404), width=15)
R> strands <- sample(c('+', '-'), length(ranges), replace=TRUE)
R> is.pos <- strands == '+'
R> v <- Views(Hsapiens$chr1, ranges) ## Pay no mind to the NNN's in
the tail here
I now want to reverseComplement the strings in v where strands == '-'
R> strings <- DNAStringSet(v)
R> strings[!is.pos] <- reverseComplement(strings[!is.pos])
That above command takes a very long time to complete ... it's
actually been over a few minutes on my machine, so I'm killing it.
Now let's do it another way, using normal character vectors and stuff
R> strings <- DNAStringSet(v)
R> result <- character(length(strings))
R> result[is.pos] <- as.character(strings[is.pos])
R> result[!is.pos] <- as.character(reverseComplement(strings[!is.pos]))
## and optionally ##
R> result <- DNAStringSet(result)
Just curious if there's a better way?
R> sessionInfo()
R version 2.11.0 (2010-04-22)
x86_64-apple-darwin9.8.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BSgenome.Hsapiens.UCSC.hg18_1.3.16 BSgenome_1.16.0
Biostrings_2.16.0 GenomicRanges_1.0.1
[5] IRanges_1.6.0
loaded via a namespace (and not attached):
[1] Biobase_2.8.0
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the Bioconductor
mailing list