[R] how to work with long vectors

Fri Nov 5 18:15:05 CET 2010

There was a numerical typo below, I said
the sample sizes were 5 and 10 thousand,
I should have said 10 and 20 thousand
(the size argument to sample()).

Also, I timed cover_per_2 and _3 for size
200,000 and gots times of 338 and 0.12 seconds,
respectively.  Growing the problem by a factor
to 10 made cover_per_2 used 100 times more time
and cover_per_3 c. 10 times more (the times are
too small to get an accurate ratio).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap
> Sent: Friday, November 05, 2010 9:58 AM
> To: Changbin Du
> Cc: r-help at r-project.org
> Subject: Re: [R] how to work with long vectors
> 
> The following cover_per_3 uses sorting to solve
> the problem more quickly.  It still has room
> for improvement.
> 
> cover_per_3 <- function (data) 
> {
>     n <- length(data)
>     o <- rev(order(data))
>     sdata <- data[o]
>     r <- rle(sdata)$lengths
>     output <- numeric(n)
>     output[o] <- rep(cumsum(r), r)
>     100 * output/n
> }
> 
> (The ecdf function would probabably also do the
> job quickly.)
> 
> When trying to work on problems like this I find
> it most fruitful to work on smaller datasets and
> see how the time grows with the size of the data,
> instead of seeing how many days a it takes on a huge
> dataset.  E.g., the following compares times for
> your original function, Phil Spector's simple cleanup
> of your function, and the sort based approach for
> vectors of length 5 and 10 thousand.
> 
> > z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
>    user  system elapsed 
>   38.21    0.00   38.41 
>    user  system elapsed 
>    0.86    0.00    0.86 
>    user  system elapsed 
>       0       0       0 
> > identical(v_3,v)
> [1] TRUE
> > z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
>    user  system elapsed 
>  158.48    0.07  159.31 
>    user  system elapsed 
>    3.23    0.00    3.25 
>    user  system elapsed 
>    0.02    0.00    0.02 
> > identical(v_3,v)
> [1] TRUE
> 
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com  
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org 
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Changbin Du
> > Sent: Friday, November 05, 2010 9:14 AM
> > To: Phil Spector
> > Cc: <r-help at r-project.org>
> > Subject: Re: [R] how to work with long vectors
> > 
> > HI, Phil,
> > 
> > I used the following codes and run it overnight for 15 hours, 
> > this morning,
> > I stopped it. It seems it is still not efficient.
> > 
> > 
> > >
> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
> UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
> ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > sep="\t", skip=0, header=F,fill=T) #
> > > names(matt)<-c("id","reads")
> > 
> > > dim(matt)
> > [1] 3384766       2
> > 
> > >  cover<-matt$reads
> > 
> > > cover_per_2 <- function(data){
> > +   l = length(data)
> > +   output = numeric(l)
> > +   for(i in 1:l)output[i] = sum(data >= data[i])
> > +   100 * output / l
> > + }
> > 
> > > result3<-cover_per_2(cover)
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du 
> > <changbind at gmail.com> wrote:
> > 
> > > Thanks Phil, that is great! I WILL try this and let you 
> > know how it goes.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector 
> > <spector at stat.berkeley.edu>wrote:
> > >
> > >> Changbin -
> > >>   Does
> > >>
> > >>    100 * sapply(matt$reads,function(x)sum(matt$reads >=
> > >> x))/length(matt$reads)
> > >>
> > >> give what you want?
> > >>
> > >>    By the way, if you want to use a loop (there's nothing 
> > wrong with
> > >> that),
> > >> then try to avoid the most common mistake that people make 
> > with loops in
> > >> R:
> > >> having your result grow inside the loop.  Here's a better 
> > way to use a
> > >> loop
> > >> to solve your problem:
> > >>
> > >> cover_per_1 <- function(data){
> > >>   l = length(data)
> > >>   output = numeric(l)
> > >>   for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
> > >> 0))/length(data)
> > >>   output
> > >> }
> > >>
> > >> Using some random data, and comparing to your original 
> > cover_per function:
> > >>
> > >>  dat = rnorm(1000)
> > >>> system.time(one <- cover_per(dat))
> > >>>
> > >>   user  system elapsed
> > >>  0.816   0.000   0.824
> > >>
> > >>> system.time(two <- cover_per_1(dat))
> > >>>
> > >>   user  system elapsed
> > >>  0.792   0.000   0.805
> > >>
> > >> Not that big a speedup, but it does increase quite a bit 
> > as the problem
> > >> gets
> > >> larger.
> > >>
> > >> There are two obvious ways to speed up your function:
> > >>   1)  Eliminate the ifelse function, since automatic 
> coersion from
> > >>       logical to numeric does the same thing.
> > >>   2)  Multiply by 100 and divide by the length outside the loop:
> > >>
> > >> cover_per_2 <- function(data){
> > >>   l = length(data)
> > >>   output = numeric(l)
> > >>   for(i in 1:l)output[i] = sum(data >= data[i])
> > >>   100 * output / l
> > >> }
> > >>
> > >>  system.time(three <- cover_per_2(dat))
> > >>>
> > >>   user  system elapsed
> > >>  0.024   0.000   0.027
> > >>
> > >> That makes the loop just about equivalent to the sapply solution:
> > >>
> > >>  system.time(four <- 100*sapply(dat,function(x)sum(dat >= 
> > x))/length(dat))
> > >>>
> > >>   user  system elapsed
> > >>  0.024   0.000   0.026
> > >>
> > >>                                        - Phil Spector
> > >>                                         Statistical 
> > Computing Facility
> > >>                                         Department of Statistics
> > >>                                         UC Berkeley
> > >>                                         spector at stat.berkeley.edu
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Thu, 4 Nov 2010, Changbin Du wrote:
> > >>
> > >>  HI, Dear R community,
> > >>>
> > >>> I have one data set like this,  What I want to do is to 
> > calculate the
> > >>> cumulative coverage. The following codes works for small 
> > data set (#rows
> > >>> =
> > >>> 100), but when feed the whole data set,  it still running 
> > after 24 hours.
> > >>> Can someone give some suggestions for long vector?
> > >>>
> > >>> id                reads
> > >>> Contig79:1    4
> > >>> Contig79:2    8
> > >>> Contig79:3    13
> > >>> Contig79:4    14
> > >>> Contig79:5    17
> > >>> Contig79:6    20
> > >>> Contig79:7    25
> > >>> Contig79:8    27
> > >>> Contig79:9    32
> > >>> Contig79:10    33
> > >>> Contig79:11    34
> > >>>
> > >>>
> > >>> 
> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
> UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
> ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > >>> sep="\t", skip=0, header=F,fill=T) #
> > >>> dim(matt)
> > >>> [1] 3384766       2
> > >>>
> > >>> matt_plot<-function(matt, outputfile) {
> > >>> names(matt)<-c("id","reads")
> > >>>
> > >>> cover<-matt$reads
> > >>>
> > >>>
> > >>> #calculate the cumulative coverage.
> > >>> + cover_per<-function (data) {
> > >>> + output<-numeric(0)
> > >>> + for (i in data) {
> > >>> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> > >>> +           output<-c(output, x)
> > >>> +                 }
> > >>> + return(output)
> > >>> + }
> > >>>
> > >>>
> > >>> result<-cover_per(cover)
> > >>>
> > >>>
> > >>> Thanks so much!
> > >>>
> > >>>
> > >>> --
> > >>> Sincerely,
> > >>> Changbin
> > >>> --
> > >>>
> > >>>        [[alternative HTML version deleted]]
> > >>>
> > >>> ______________________________________________
> > >>> R-help at r-project.org mailing list
> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>> PLEASE do read the posting guide
> > >>> http://www.R-project.org/posting-guide.html
> > >>> and provide commented, minimal, self-contained, 
> reproducible code.
> > >>>
> > >>>
> > >
> > >
> > > --
> > > Sincerely,
> > > Changbin
> > > --
> > >
> > > Changbin Du
> > > DOE Joint Genome Institute
> > > Bldg 400 Rm 457
> > > 2800 Mitchell Dr
> > > Walnut Creet, CA 94598
> > > Phone: 925-927-2856
> > >
> > >
> > >
> > 
> > 
> > -- 
> > Sincerely,
> > Changbin
> > --
> > 
> > Changbin Du
> > DOE Joint Genome Institute
> > Bldg 400 Rm 457
> > 2800 Mitchell Dr
> > Walnut Creet, CA 94598
> > Phone: 925-927-2856
> > 
> > 	[[alternative HTML version deleted]]
> > 
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> > 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>