[R] how to work with long vectors

Fri Nov 5 17:58:04 CET 2010

The following cover_per_3 uses sorting to solve
the problem more quickly.  It still has room
for improvement.

cover_per_3 <- function (data) 
{
    n <- length(data)
    o <- rev(order(data))
    sdata <- data[o]
    r <- rle(sdata)$lengths
    output <- numeric(n)
    output[o] <- rep(cumsum(r), r)
    100 * output/n
}

(The ecdf function would probabably also do the
job quickly.)

When trying to work on problems like this I find
it most fruitful to work on smaller datasets and
see how the time grows with the size of the data,
instead of seeing how many days a it takes on a huge
dataset.  E.g., the following compares times for
your original function, Phil Spector's simple cleanup
of your function, and the sort based approach for
vectors of length 5 and 10 thousand.

> z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <-
cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
print(system.time(v_3 <- cover_per_3(z)))
   user  system elapsed 
  38.21    0.00   38.41 
   user  system elapsed 
   0.86    0.00    0.86 
   user  system elapsed 
      0       0       0 
> identical(v_3,v)
[1] TRUE
> z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <-
cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
print(system.time(v_3 <- cover_per_3(z)))
   user  system elapsed 
 158.48    0.07  159.31 
   user  system elapsed 
   3.23    0.00    3.25 
   user  system elapsed 
   0.02    0.00    0.02 
> identical(v_3,v)
[1] TRUE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Changbin Du
> Sent: Friday, November 05, 2010 9:14 AM
> To: Phil Spector
> Cc: <r-help at r-project.org>
> Subject: Re: [R] how to work with long vectors
> 
> HI, Phil,
> 
> I used the following codes and run it overnight for 15 hours, 
> this morning,
> I stopped it. It seems it is still not efficient.
> 
> 
> >
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
> > names(matt)<-c("id","reads")
> 
> > dim(matt)
> [1] 3384766       2
> 
> >  cover<-matt$reads
> 
> > cover_per_2 <- function(data){
> +   l = length(data)
> +   output = numeric(l)
> +   for(i in 1:l)output[i] = sum(data >= data[i])
> +   100 * output / l
> + }
> 
> > result3<-cover_per_2(cover)
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du 
> <changbind at gmail.com> wrote:
> 
> > Thanks Phil, that is great! I WILL try this and let you 
> know how it goes.
> >
> >
> >
> >
> > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector 
> <spector at stat.berkeley.edu>wrote:
> >
> >> Changbin -
> >>   Does
> >>
> >>    100 * sapply(matt$reads,function(x)sum(matt$reads >=
> >> x))/length(matt$reads)
> >>
> >> give what you want?
> >>
> >>    By the way, if you want to use a loop (there's nothing 
> wrong with
> >> that),
> >> then try to avoid the most common mistake that people make 
> with loops in
> >> R:
> >> having your result grow inside the loop.  Here's a better 
> way to use a
> >> loop
> >> to solve your problem:
> >>
> >> cover_per_1 <- function(data){
> >>   l = length(data)
> >>   output = numeric(l)
> >>   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
> >> 0))/length(data)
> >>   output
> >> }
> >>
> >> Using some random data, and comparing to your original 
> cover_per function:
> >>
> >>  dat = rnorm(1000)
> >>> system.time(one <- cover_per(dat))
> >>>
> >>   user  system elapsed
> >>  0.816   0.000   0.824
> >>
> >>> system.time(two <- cover_per_1(dat))
> >>>
> >>   user  system elapsed
> >>  0.792   0.000   0.805
> >>
> >> Not that big a speedup, but it does increase quite a bit 
> as the problem
> >> gets
> >> larger.
> >>
> >> There are two obvious ways to speed up your function:
> >>   1)  Eliminate the ifelse function, since automatic coersion from
> >>       logical to numeric does the same thing.
> >>   2)  Multiply by 100 and divide by the length outside the loop:
> >>
> >> cover_per_2 <- function(data){
> >>   l = length(data)
> >>   output = numeric(l)
> >>   for(i in 1:l)output[i] = sum(data >= data[i])
> >>   100 * output / l
> >> }
> >>
> >>  system.time(three <- cover_per_2(dat))
> >>>
> >>   user  system elapsed
> >>  0.024   0.000   0.027
> >>
> >> That makes the loop just about equivalent to the sapply solution:
> >>
> >>  system.time(four <- 100*sapply(dat,function(x)sum(dat >= 
> x))/length(dat))
> >>>
> >>   user  system elapsed
> >>  0.024   0.000   0.026
> >>
> >>                                        - Phil Spector
> >>                                         Statistical 
> Computing Facility
> >>                                         Department of Statistics
> >>                                         UC Berkeley
> >>                                         spector at stat.berkeley.edu
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, 4 Nov 2010, Changbin Du wrote:
> >>
> >>  HI, Dear R community,
> >>>
> >>> I have one data set like this,  What I want to do is to 
> calculate the
> >>> cumulative coverage. The following codes works for small 
> data set (#rows
> >>> =
> >>> 100), but when feed the whole data set,  it still running 
> after 24 hours.
> >>> Can someone give some suggestions for long vector?
> >>>
> >>> id                reads
> >>> Contig79:1    4
> >>> Contig79:2    8
> >>> Contig79:3    13
> >>> Contig79:4    14
> >>> Contig79:5    17
> >>> Contig79:6    20
> >>> Contig79:7    25
> >>> Contig79:8    27
> >>> Contig79:9    32
> >>> Contig79:10    33
> >>> Contig79:11    34
> >>>
> >>>
> >>> 
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> >>> sep="\t", skip=0, header=F,fill=T) #
> >>> dim(matt)
> >>> [1] 3384766       2
> >>>
> >>> matt_plot<-function(matt, outputfile) {
> >>> names(matt)<-c("id","reads")
> >>>
> >>> cover<-matt$reads
> >>>
> >>>
> >>> #calculate the cumulative coverage.
> >>> + cover_per<-function (data) {
> >>> + output<-numeric(0)
> >>> + for (i in data) {
> >>> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> >>> +           output<-c(output, x)
> >>> +                 }
> >>> + return(output)
> >>> + }
> >>>
> >>>
> >>> result<-cover_per(cover)
> >>>
> >>>
> >>> Thanks so much!
> >>>
> >>>
> >>> --
> >>> Sincerely,
> >>> Changbin
> >>> --
> >>>
> >>>        [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >
> >
> > --
> > Sincerely,
> > Changbin
> > --
> >
> > Changbin Du
> > DOE Joint Genome Institute
> > Bldg 400 Rm 457
> > 2800 Mitchell Dr
> > Walnut Creet, CA 94598
> > Phone: 925-927-2856
> >
> >
> >
> 
> 
> -- 
> Sincerely,
> Changbin
> --
> 
> Changbin Du
> DOE Joint Genome Institute
> Bldg 400 Rm 457
> 2800 Mitchell Dr
> Walnut Creet, CA 94598
> Phone: 925-927-2856
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>