[BioC] how to operate on a DNAStringSet object

Chris Seidel seidel at phaget4.org
Thu Mar 21 21:48:01 CET 2013

I can't  figure out how to operate on a DNAStringSet object, and get a 
DNAStringSet object back. I have a DNAStringSet object for which I'd 
like to randomize the order of nucleotides in the sequences. Here is a 
simple example:

# function to randomly re-order a string
randomizeSeq <- function(mystring){
   # split the string into characters
   myChars <- unlist(strsplit(as.character(mystring),''))
   # randomize the order of characters
   random.myChars <- sample(myChars,length(myChars))
   # concatenate them back into a string and return the result

# create some sequences
seq1 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="")
seq2 <- paste(DNA_BASES[sample(1:4,5,replace=T)], collapse="")

# make a DNAStringSet object
myseqs <- DNAStringSet(c(seq1,seq2))
names(myseqs) <- c("s1", "s2")

# apply the randomize function to the DNAStringSet object 
myRandomizedseqs <- sapply(myseqs,randomizeSeq)

# The result is a list, try to make a DNAStringSetObject out of it

# that doesn't work, try something else

# that doesn't work. 

I gave up (after several hours) and wrote a for loop, which works but 
takes forever on my sequence set, and makes me feel stupid.

What's odd, is that this actually works:


*IF* the sequences are NOT NAMED.

How does one operate on the sequences of a DNAStringSet object without 
getting a list back, or without a for loop? I'm sure there's some 
elegant one-liner that completely escapes me.

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)

[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United 
States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BSgenome.Dmelanogaster.UCSC.dm3_1.3.19 BSgenome_1.26.1                        
Biostrings_2.26.2                      GenomicRanges_1.10.5                  
[5] IRanges_1.16.4                         BiocGenerics_0.4.0                    

loaded via a namespace (and not attached):
[1] parallel_2.15.2 stats4_2.15.2   tools_2.15.2

More information about the Bioconductor mailing list