[Bioc-devel] Subject: GRanges performance issue, how to avoid looping?

Hervé Pagès hpages at fhcrc.org
Thu Feb 23 00:50:18 CET 2012


Hi Jesper,

On 02/22/2012 04:15 AM, Jesper Gådin wrote:
> Hi everyone!
> I want to find all reads that map to a specific position in a GRanges
> object.
> In my case this object is named "reads" and I have tried using the function
> below.
>
>
> #From a given position and chromosome, find the reads where this position
> map
> fun_find_reads_from_pos<- function(reads,chr,someposition, verbose=TRUE) {
>
>      #get length over the sample list
>      nr_samples<-length(reads)
>
>      for (j in (1:nr_samples)) {
>          onePerson<- reads[[j]]
>          chr_onePerson<- onePerson[seqnames(onePerson)==chr]
>          pre_seq<-
> chr_onePerson[(ranges(chr_onePerson))%in%(IRanges(start=someposition,
> width=1)),6]
>
>          if(!length(pre_seq)==0) {
>              for (k in (1:length(pre_seq))) {
>                  position<- (someposition-start(pre_seq[k])) +1
> #57537220
>                  print("found the given someposition on this position within
> the read")
>                  print(position)
>              }
>          }
>      }
>
> } #End of function
>
>
> To use the function try:
> fun_find_reads_from_pos(reads,"chr12",57537220)

We can't run this function because we don't know what 'reads' is.
It looks like a list-like object (because you are using [[ on it),
where each element could be a GRanges or GappedAlignments object
(because you are using seqnames() and ranges() on those elements).
Given the subject of your email, I'll assume it's a GRangesList
object or a list of GRanges objects.

>
> Now that should work.
> And everything would be fine if it wasnt for the time issue.
> The time-thief I guess is that I have to loop over every read
> in every sample to get to the information I want.

You are using 2 nested loops.
The outer loop is on the samples and I would expect the nb of samples
to be relatively small, so this loop is probably not an issue.
The inner loop is on the genomic ranges that actually overlap with
the specified position. It is probably expensive but you shouldn't
do that loop. Instead you should take advantage of the fact that
arithmetic is vectorized in R. So instead of:

         if(!length(pre_seq)==0) {
             for (k in (1:length(pre_seq))) {
                 position <- (someposition-start(pre_seq[k])) +1
                 print("found the given someposition on this position 
within the read")
                 print(position)
             }
         }

do something like:

         if (length(pre_seq) != 0L) {
             position <- someposition - start(pre_seq) + 1L
             cat("found the given someposition on those positions within 
the read:\n")
             print(position)
         }

Now it's not clear to me why you want to print this but that's another
story.

Cheers,
H.

> Is it
> possible to use the data structure in another better way to
> avoid unnecessary looping. Anyone have an idea?
>
> This function would be the core function of my future package.
> So its important that it is effective.
>
> Sincerely,
> Jesper
>
> Link to the reads object needed to run the function:
> http://uppsalanf.se/sites/default/files/reads_object.RData
> And the function itself (same as above):
> http://uppsalanf.se/sites/default/files/example-bioc-dev.R
>
>
>> sessionInfo()
> R version 2.13.2 (2011-09-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.20.4   GenomicRanges_1.4.8 IRanges_1.10.6
>> ?GRanges
>> sessionInfo()
> R version 2.13.2 (2011-09-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] Biostrings_2.20.4   GenomicRanges_1.4.8 IRanges_1.10.6
>
> loaded via a namespace (and not attached):
> [1] tools_2.13.2
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list