[Bioc-sig-seq] Fast inject into XStringSet objects

Wed May 20 01:06:54 CEST 2009

Thomas,

Thomas Girke wrote:
> Dear Herve,
> 
> Thanks a lot for your suggestions and efforts to optimize this step!
> 
> We are often using sequences of variable length (e.g. after adaptor 
> trimming), but this shouldn't be a problem because we can simply loop 
> over every length interval (usually less than 100 iterations!).

Yes I could extend replaceLetterAt() so it supports DNAStringSet objects
of arbitrary shape. In that case, two formats for 'at' would be natural:
"list of logical vectors" or "list of integer vectors". Then you could do
something like this:

(1) Trim the adaptor:

       trimming <- trimLRPatterns(..., dset, ranges=TRUE)
       trimmed_dset <- narrow(dset, start=start(trimming), end=end(trimming))
       trimmed_myqual <- narrow(myqual,
                                start=start(trimming),
                                end=end(trimming))

(2) Build the 'at' object to be used in replaceLetterAt():

       flat_quals <- charToRaw(as.character(unlist(trimmed_myqual)))
       flat_at <- flat_quals < charToRaw(as.character(PhredQuality(20L)))
       at <- split(flat_at,
                   rep.int(seq_len(length(trimmed_myqual)),
                   width(trimmed_myqual)))

(3) Build the 'letter' object to be used in replaceLetterAt():

       letter_subject <-
           DNAString(paste(rep.int("N", max(width(trimmed_dset))), collapse=""))
       letter <- as(Views(letter_subject, start=1, end=sapply(at, sum)),
                    "DNAStringSet")

(4) Call replaceLetterAt():

       dset4 <- replaceLetterAt(trimmed_dset, at, letter)

Also some convenience could be added to Biostrings to support this use case
by wrapping some of the steps above into utility functions e.g.
  o (1) could be done in a single call if we had a "trimLRPatterns"
    method for QualityScaledDNAStringSet objects (which is easy
    to add),
  o (2) and (3) could be wrapped too but I don't have good names for
    the wrappers and am not sure about their arguments either. I would
    need to think a little bit about it.

What do you think? Would that work?

> 
> One short question:
> Are there any plans to support masks in the future on XStringSet
> objects? If one could define and apply masks to sequences of variable
> length in XStringSet objects (without loops) then this would be be very
> useful for many applications.

No plans to support *soft* masking of XStringSet objects in the near
future because it's a too complicated thing to put in place, at least
for now. However *hard* masking (i.e. alteration of an XStringSet object
by injection of the "-" letter at some locations) is much easier to
support because 99% of the times you don't need to touch the implementation
of the functions that operate on XStringSet objects to make them work
on a hard-masked XStringSet object. They just keep working just fine.
The downside of hard masking is that every time the hard mask needs to be
modified, you need to make a new copy of the XStringSet object. But we
first need to see use cases where this becomes really an issue before
we start working on *soft* masking of XStringSet objects.

Cheers,
H.

> 
> Many thanks again, 
> 
> Thomas
> 
> On Tue, May 19, 2009 at 11:17:16AM -0700, Hervé Pagès wrote:
>> Hi Thomas,
>>
>> Sorry for the slow response on this.
>>
>> Here is a much faster solution to your problem. Depending on the
>> size of the 'dset' and 'myqual' rectangles, it will be 100x or 1000x
>> faster than the sapply-based solution. Let's assume 'dset' is the
>> constant width DNAStringSet object containing your reads and 'myqual'
>> is the constant width PhredQuality object containing the qualities
>> of your reads.
>> The following solution takes advantage of the fact that 'dset' and
>> therefore 'myqual' are rectangular, hence 'myqual' can be turned into
>> a raw matrix pretty efficiently:
>>
>>   myqual_mat <- matrix(charToRaw(as.character(unlist(myqual))),
>>                        nrow=length(myqual), byrow=TRUE)
>>
>> This raw matrix itself can be turned into a boolean matrix using some
>> cutoff value:
>>
>>   at <- myqual_mat < charToRaw(as.character(PhredQuality(20L)))
>>
>> Also I've just added a "replaceLetterAt" method for DNAStringSet objects
>> to Biostrings (in addition to the existing method for DNAString objects).
>> Currently, it has the limitation that 'x' and 'at' must be rectangular
>> i.e. 'x' must have a constant width and 'at' must be a logical matrix
>> with the same dimensions as 'x'. That's of course just what's needed to
>> handle your usecase:
>>
>>   letter_subject <- DNAString(paste(rep.int("N", width(dset)[1]), 
>>   collapse=""))
>>   letter <- as(Views(letter_subject, start=1, end=rowSums(at)), 
>>   "DNAStringSet")
>>   dset3 <- replaceLetterAt(dset, at, letter)
>>
>> 'dset3' is the same as the 'dset2' object you obtained with the sapply-based
>> solution.
>>
>> This improved replaceLetterAt() function will be available in Biostrings
>> release (v. 2.12.3) and devel (v. 2.13.8) when they propagate to the public
>> repositories (in the next 24 hours).
>>
>> Please let me know if you have any other concerns.
>>
>> Cheers,
>> H.
>>
>>
>> Thomas Girke wrote:
>>> Dear Developers,
>>>
>>> Is there an efficient method available for replacing base calls in 
>>> sequences by Ns (or any other character) at positions where the quality 
>>> score drops below some arbitrary threshold. So far we have been using 
>>> the following code for this purpose. In general this approach works fine. 
>>> However, the R loop for applying this function makes it slow for very 
>>> large data sets with millions of sequences. 
>>>
>>> Is there a method available for achieving the same result more 
>>> efficiently? 
>>> Considering the available resources in Biostrings, I am wondering if it 
>>> wouldn't make a lot of sense of using Biostrings' sequence masking method 
>>> along with the injectHardMask function for this purpose. However, at this 
>>> point I am only aware of methods for applying masks efficiently to a 
>>> single sequence in a XString object, but not to many sequences stored in a 
>>> XStringSet object. 
>>>
>>> #################
>>> ## Sample Code ##
>>> #################
>>> ## Create example data set with random data
>>> library(Biostrings)
>>> dset <- DNAStringSet(sapply(1:100, function(x) 
>>> paste(sample(c("A","T","G","C"), 20, replace=T), collapse=""))) myqlist <- 
>>> lapply(1:100, function(x) sample(1:40, 20, replace=T)) # Create random 
>>> Phred score list.
>>> myqual <- sapply(myqlist, function(x) toString(PhredQuality(x))) 
>>> myqual <- PhredQuality(myqual) 
>>> dsetq1 <- QualityScaledDNAStringSet(dset, myqual) 
>>>
>>> ## Define in character inject function XStringSetInject     
>>> XStringSetInject <- function(myseq=dset[[1]], myqual=myqual[1], cutoff=20, 
>>> mychar="N") { mypos <- which(as.integer(myqual)<cutoff)
>>>        return(toString(replaceLetterAt(myseq, 
>>>        which(as.integer(myqual)<cutoff), rep(mychar, length(mypos)))))
>>> }
>>> ## Apply XStringSetInject function to all sequences.
>>> dvec <- sapply(1:length(dset), function(x) 
>>> XStringSetInject(myseq=dset[[x]], myqual=myqual[x], cutoff=20))
>>> dset2 <- DNAStringSet(dvec)
>>> names(dset2) <- paste("d", 1:length(dvec), sep="")
>>> names(myqual) <- paste("d", 1:length(dvec), sep="")
>>> dsetq2 <- QualityScaledDNAStringSet(dset2, myqual) 
>>>
>>> ## Compare dsetq1 and dsetq2
>>> dsetq1[1:2] 
>>> dsetq2[1:2] 
>>>
>>>
>>>> sessionInfo()
>>> R version 2.9.0 (2009-04-17) 
>>> x86_64-unknown-linux-gnu 
>>>
>>> locale:
>>> LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base     
>>>
>>> other attached packages:
>>> [1] Biostrings_2.12.1 IRanges_1.2.0    
>>>
>>> loaded via a namespace (and not attached):
>>> [1] Biobase_2.4.0
>>>
>>>
>>> Thanks in advance,
>>>
>>> Thomas
>>>
>>> _______________________________________________
>>> Bioc-sig-sequencing mailing list
>>> Bioc-sig-sequencing at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioc-sig-sequencing
>> -- 
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M2-B876
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319