[Bioc-sig-seq] Fast inject into XStringSet objects

Wed May 20 03:24:54 CEST 2009

Herve,
The trimLRPatterns already can do (1) below so you can tick that off the 
TODO list (This is because there is a narrow method for 
QualityScaledXStringSet):

 > suppressMessages(library(Biostrings))
 > data(srPhiX174)
 > x <- QualityScaledDNAStringSet(srPhiX174, SolexaQuality(quPhiX174))
 > head(x)
  A QualityScaledDNAStringSet instance containing:

  A DNAStringSet instance of length 6
    width seq
[1]    35 GTTATTATACCGTCAAGGACTGTGTGACTATTGAC
[2]    35 GGTGGTTATTATACCGTCAAGGACTGTGTGACTAT
[3]    35 TACCGTCAAGGACTGTGTGACTATTGACGTCCTTC
[4]    35 GTACGCCGGGCAATAATGTTTATGTTGGTTTCATG
[5]    35 GGTTTCATGGTTTGGTCTAACTTTACCGCTACTAA
[6]    35 GGGCAATAATGTTTATGTTGGTTTCATGGTTTGGT

  A SolexaQuality instance of length 6
    width seq
[1]    35 ZYZZZZZZZZZYYZZYYYYYYYYYYYYYYYYYQYY
[2]    35 ZZYZZYZZZZYYYYYYYYYYYYYYYYYYYVYYYTY
[3]    35 ZZZYZYYZYYZYYZYYYYYYYYYYYYYYVYYYYYY
[4]    35 ZZYZZZZZZZZZYZTYYYYYYYYYYYYYYYYYNYT
[5]    35 ZZZZZZYZYYZZZYYYYYYYYYYYYYYYYYSYYSY
[6]    35 ZZZZZYZYYYZYYZYYYYYYYYYYYSYQVYYYASY

 > head(trimLRPatterns(Lpattern = "GT", Rpattern = "AT", subject = x))
  A QualityScaledDNAStringSet instance containing:

  A DNAStringSet instance of length 6
    width seq
[1]    33 TATTATACCGTCAAGGACTGTGTGACTATTGAC
[2]    33 GGTGGTTATTATACCGTCAAGGACTGTGTGACT
[3]    35 TACCGTCAAGGACTGTGTGACTATTGACGTCCTTC
[4]    33 ACGCCGGGCAATAATGTTTATGTTGGTTTCATG
[5]    35 GGTTTCATGGTTTGGTCTAACTTTACCGCTACTAA
[6]    35 GGGCAATAATGTTTATGTTGGTTTCATGGTTTGGT

  A SolexaQuality instance of length 6
    width seq
[1]    33 ZZZZZZZZZYYZZYYYYYYYYYYYYYYYYYQYY
[2]    33 ZZYZZYZZZZYYYYYYYYYYYYYYYYYYYVYYY
[3]    35 ZZZYZYYZYYZYYZYYYYYYYYYYYYYYVYYYYYY
[4]    33 YZZZZZZZZZYZTYYYYYYYYYYYYYYYYYNYT
[5]    35 ZZZZZZYZYYZZZYYYYYYYYYYYYYYYYYSYYSY
[6]    35 ZZZZZYZYYYZYYZYYYYYYYYYYYSYQVYYYASY

 > sessionInfo()
R version 2.10.0 Under development (unstable) (2009-05-08 r48504)
i386-apple-darwin9.6.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
[1] Biostrings_2.13.8 IRanges_1.3.12  

loaded via a namespace (and not attached):
[1] Biobase_2.5.2 tools_2.10.0


Hervé Pagès wrote:
> Thomas,
>
> Thomas Girke wrote:
>> Dear Herve,
>>
>> Thanks a lot for your suggestions and efforts to optimize this step!
>>
>> We are often using sequences of variable length (e.g. after adaptor 
>> trimming), but this shouldn't be a problem because we can simply loop 
>> over every length interval (usually less than 100 iterations!).
>
> Yes I could extend replaceLetterAt() so it supports DNAStringSet objects
> of arbitrary shape. In that case, two formats for 'at' would be natural:
> "list of logical vectors" or "list of integer vectors". Then you could do
> something like this:
>
> (1) Trim the adaptor:
>
>       trimming <- trimLRPatterns(..., dset, ranges=TRUE)
>       trimmed_dset <- narrow(dset, start=start(trimming), 
> end=end(trimming))
>       trimmed_myqual <- narrow(myqual,
>                                start=start(trimming),
>                                end=end(trimming))
>
> (2) Build the 'at' object to be used in replaceLetterAt():
>
>       flat_quals <- charToRaw(as.character(unlist(trimmed_myqual)))
>       flat_at <- flat_quals < charToRaw(as.character(PhredQuality(20L)))
>       at <- split(flat_at,
>                   rep.int(seq_len(length(trimmed_myqual)),
>                   width(trimmed_myqual)))
>
> (3) Build the 'letter' object to be used in replaceLetterAt():
>
>       letter_subject <-
>           DNAString(paste(rep.int("N", max(width(trimmed_dset))), 
> collapse=""))
>       letter <- as(Views(letter_subject, start=1, end=sapply(at, sum)),
>                    "DNAStringSet")
>
> (4) Call replaceLetterAt():
>
>       dset4 <- replaceLetterAt(trimmed_dset, at, letter)
>
> Also some convenience could be added to Biostrings to support this use 
> case
> by wrapping some of the steps above into utility functions e.g.
>  o (1) could be done in a single call if we had a "trimLRPatterns"
>    method for QualityScaledDNAStringSet objects (which is easy
>    to add),
>  o (2) and (3) could be wrapped too but I don't have good names for
>    the wrappers and am not sure about their arguments either. I would
>    need to think a little bit about it.
>
> What do you think? Would that work?
>
>>
>> One short question:
>> Are there any plans to support masks in the future on XStringSet
>> objects? If one could define and apply masks to sequences of variable
>> length in XStringSet objects (without loops) then this would be be very
>> useful for many applications.
>
> No plans to support *soft* masking of XStringSet objects in the near
> future because it's a too complicated thing to put in place, at least
> for now. However *hard* masking (i.e. alteration of an XStringSet object
> by injection of the "-" letter at some locations) is much easier to
> support because 99% of the times you don't need to touch the 
> implementation
> of the functions that operate on XStringSet objects to make them work
> on a hard-masked XStringSet object. They just keep working just fine.
> The downside of hard masking is that every time the hard mask needs to be
> modified, you need to make a new copy of the XStringSet object. But we
> first need to see use cases where this becomes really an issue before
> we start working on *soft* masking of XStringSet objects.
>
> Cheers,
> H.
>
>
>>
>> Many thanks again,
>> Thomas
>>
>> On Tue, May 19, 2009 at 11:17:16AM -0700, Hervé Pagès wrote:
>>> Hi Thomas,
>>>
>>> Sorry for the slow response on this.
>>>
>>> Here is a much faster solution to your problem. Depending on the
>>> size of the 'dset' and 'myqual' rectangles, it will be 100x or 1000x
>>> faster than the sapply-based solution. Let's assume 'dset' is the
>>> constant width DNAStringSet object containing your reads and 'myqual'
>>> is the constant width PhredQuality object containing the qualities
>>> of your reads.
>>> The following solution takes advantage of the fact that 'dset' and
>>> therefore 'myqual' are rectangular, hence 'myqual' can be turned into
>>> a raw matrix pretty efficiently:
>>>
>>>   myqual_mat <- matrix(charToRaw(as.character(unlist(myqual))),
>>>                        nrow=length(myqual), byrow=TRUE)
>>>
>>> This raw matrix itself can be turned into a boolean matrix using some
>>> cutoff value:
>>>
>>>   at <- myqual_mat < charToRaw(as.character(PhredQuality(20L)))
>>>
>>> Also I've just added a "replaceLetterAt" method for DNAStringSet 
>>> objects
>>> to Biostrings (in addition to the existing method for DNAString 
>>> objects).
>>> Currently, it has the limitation that 'x' and 'at' must be rectangular
>>> i.e. 'x' must have a constant width and 'at' must be a logical matrix
>>> with the same dimensions as 'x'. That's of course just what's needed to
>>> handle your usecase:
>>>
>>>   letter_subject <- DNAString(paste(rep.int("N", width(dset)[1]),   
>>> collapse=""))
>>>   letter <- as(Views(letter_subject, start=1, end=rowSums(at)),   
>>> "DNAStringSet")
>>>   dset3 <- replaceLetterAt(dset, at, letter)
>>>
>>> 'dset3' is the same as the 'dset2' object you obtained with the 
>>> sapply-based
>>> solution.
>>>
>>> This improved replaceLetterAt() function will be available in 
>>> Biostrings
>>> release (v. 2.12.3) and devel (v. 2.13.8) when they propagate to the 
>>> public
>>> repositories (in the next 24 hours).
>>>
>>> Please let me know if you have any other concerns.
>>>
>>> Cheers,
>>> H.
>>>
>>>
>>> Thomas Girke wrote:
>>>> Dear Developers,
>>>>
>>>> Is there an efficient method available for replacing base calls in 
>>>> sequences by Ns (or any other character) at positions where the 
>>>> quality score drops below some arbitrary threshold. So far we have 
>>>> been using the following code for this purpose. In general this 
>>>> approach works fine. However, the R loop for applying this function 
>>>> makes it slow for very large data sets with millions of sequences.
>>>> Is there a method available for achieving the same result more 
>>>> efficiently? Considering the available resources in Biostrings, I 
>>>> am wondering if it wouldn't make a lot of sense of using 
>>>> Biostrings' sequence masking method along with the injectHardMask 
>>>> function for this purpose. However, at this point I am only aware 
>>>> of methods for applying masks efficiently to a single sequence in a 
>>>> XString object, but not to many sequences stored in a XStringSet 
>>>> object.
>>>> #################
>>>> ## Sample Code ##
>>>> #################
>>>> ## Create example data set with random data
>>>> library(Biostrings)
>>>> dset <- DNAStringSet(sapply(1:100, function(x) 
>>>> paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) 
>>>> myqlist <- lapply(1:100, function(x) sample(1:40, 20, replace=T)) # 
>>>> Create random Phred score list.
>>>> myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) 
>>>> myqual <- PhredQuality(myqual) dsetq1 <- 
>>>> QualityScaledDNAStringSet(dset, myqual)
>>>> ## Define in character inject function XStringSetInject     
>>>> XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], 
>>>> cutoff=20, mychar="N") { mypos <- which(as.integer(myqual)<cutoff)
>>>>        return(toString(replaceLetterAt(myseq,        
>>>> which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
>>>> }
>>>> ## Apply XStringSetInject function to all sequences.
>>>> dvec <- sapply(1:length(dset), function(x) 
>>>> XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
>>>> dset2 <- DNAStringSet(dvec)
>>>> names(dset2) <- paste("d", 1:length(dvec), sep="")
>>>> names(myqual) <- paste("d", 1:length(dvec), sep="")
>>>> dsetq2 <- QualityScaledDNAStringSet(dset2, myqual)
>>>> ## Compare dsetq1 and dsetq2
>>>> dsetq1[1:2] dsetq2[1:2]
>>>>
>>>>> sessionInfo()
>>>> R version 2.9.0 (2009-04-17) x86_64-unknown-linux-gnu
>>>> locale:
>>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C 
>>>>
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets  methods   
>>>> base    
>>>> other attached packages:
>>>> [1] Biostrings_2.12.1 IRanges_1.2.0   
>>>> loaded via a namespace (and not attached):
>>>> [1] Biobase_2.4.0
>>>>
>>>>
>>>> Thanks in advance,
>>>>
>>>> Thomas
>>>>
>>>> _______________________________________________
>>>> Bioc-sig-sequencing mailing list
>>>> Bioc-sig-sequencing at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>>> -- 
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M2-B876
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>