[R] how to work with long vectors
William Dunlap
wdunlap at tibco.com
Fri Nov 5 18:15:05 CET 2010
There was a numerical typo below, I said
the sample sizes were 5 and 10 thousand,
I should have said 10 and 20 thousand
(the size argument to sample()).
Also, I timed cover_per_2 and _3 for size
200,000 and gots times of 338 and 0.12 seconds,
respectively. Growing the problem by a factor
to 10 made cover_per_2 used 100 times more time
and cover_per_3 c. 10 times more (the times are
too small to get an accurate ratio).
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of William Dunlap
> Sent: Friday, November 05, 2010 9:58 AM
> To: Changbin Du
> Cc: r-help at r-project.org
> Subject: Re: [R] how to work with long vectors
>
> The following cover_per_3 uses sorting to solve
> the problem more quickly. It still has room
> for improvement.
>
> cover_per_3 <- function (data)
> {
> n <- length(data)
> o <- rev(order(data))
> sdata <- data[o]
> r <- rle(sdata)$lengths
> output <- numeric(n)
> output[o] <- rep(cumsum(r), r)
> 100 * output/n
> }
>
> (The ecdf function would probabably also do the
> job quickly.)
>
> When trying to work on problems like this I find
> it most fruitful to work on smaller datasets and
> see how the time grows with the size of the data,
> instead of seeing how many days a it takes on a huge
> dataset. E.g., the following compares times for
> your original function, Phil Spector's simple cleanup
> of your function, and the sort based approach for
> vectors of length 5 and 10 thousand.
>
> > z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
> user system elapsed
> 38.21 0.00 38.41
> user system elapsed
> 0.86 0.00 0.86
> user system elapsed
> 0 0 0
> > identical(v_3,v)
> [1] TRUE
> > z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <-
> cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
> print(system.time(v_3 <- cover_per_3(z)))
> user system elapsed
> 158.48 0.07 159.31
> user system elapsed
> 3.23 0.00 3.25
> user system elapsed
> 0.02 0.00 0.02
> > identical(v_3,v)
> [1] TRUE
>
> Bill Dunlap
> Spotfire, TIBCO Software
> wdunlap tibco.com
>
> > -----Original Message-----
> > From: r-help-bounces at r-project.org
> > [mailto:r-help-bounces at r-project.org] On Behalf Of Changbin Du
> > Sent: Friday, November 05, 2010 9:14 AM
> > To: Phil Spector
> > Cc: <r-help at r-project.org>
> > Subject: Re: [R] how to work with long vectors
> >
> > HI, Phil,
> >
> > I used the following codes and run it overnight for 15 hours,
> > this morning,
> > I stopped it. It seems it is still not efficient.
> >
> >
> > >
> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
> UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
> ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > sep="\t", skip=0, header=F,fill=T) #
> > > names(matt)<-c("id","reads")
> >
> > > dim(matt)
> > [1] 3384766 2
> >
> > > cover<-matt$reads
> >
> > > cover_per_2 <- function(data){
> > + l = length(data)
> > + output = numeric(l)
> > + for(i in 1:l)output[i] = sum(data >= data[i])
> > + 100 * output / l
> > + }
> >
> > > result3<-cover_per_2(cover)
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du
> > <changbind at gmail.com> wrote:
> >
> > > Thanks Phil, that is great! I WILL try this and let you
> > know how it goes.
> > >
> > >
> > >
> > >
> > > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector
> > <spector at stat.berkeley.edu>wrote:
> > >
> > >> Changbin -
> > >> Does
> > >>
> > >> 100 * sapply(matt$reads,function(x)sum(matt$reads >=
> > >> x))/length(matt$reads)
> > >>
> > >> give what you want?
> > >>
> > >> By the way, if you want to use a loop (there's nothing
> > wrong with
> > >> that),
> > >> then try to avoid the most common mistake that people make
> > with loops in
> > >> R:
> > >> having your result grow inside the loop. Here's a better
> > way to use a
> > >> loop
> > >> to solve your problem:
> > >>
> > >> cover_per_1 <- function(data){
> > >> l = length(data)
> > >> output = numeric(l)
> > >> for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
> > >> 0))/length(data)
> > >> output
> > >> }
> > >>
> > >> Using some random data, and comparing to your original
> > cover_per function:
> > >>
> > >> dat = rnorm(1000)
> > >>> system.time(one <- cover_per(dat))
> > >>>
> > >> user system elapsed
> > >> 0.816 0.000 0.824
> > >>
> > >>> system.time(two <- cover_per_1(dat))
> > >>>
> > >> user system elapsed
> > >> 0.792 0.000 0.805
> > >>
> > >> Not that big a speedup, but it does increase quite a bit
> > as the problem
> > >> gets
> > >> larger.
> > >>
> > >> There are two obvious ways to speed up your function:
> > >> 1) Eliminate the ifelse function, since automatic
> coersion from
> > >> logical to numeric does the same thing.
> > >> 2) Multiply by 100 and divide by the length outside the loop:
> > >>
> > >> cover_per_2 <- function(data){
> > >> l = length(data)
> > >> output = numeric(l)
> > >> for(i in 1:l)output[i] = sum(data >= data[i])
> > >> 100 * output / l
> > >> }
> > >>
> > >> system.time(three <- cover_per_2(dat))
> > >>>
> > >> user system elapsed
> > >> 0.024 0.000 0.027
> > >>
> > >> That makes the loop just about equivalent to the sapply solution:
> > >>
> > >> system.time(four <- 100*sapply(dat,function(x)sum(dat >=
> > x))/length(dat))
> > >>>
> > >> user system elapsed
> > >> 0.024 0.000 0.026
> > >>
> > >> - Phil Spector
> > >> Statistical
> > Computing Facility
> > >> Department of Statistics
> > >> UC Berkeley
> > >> spector at stat.berkeley.edu
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Thu, 4 Nov 2010, Changbin Du wrote:
> > >>
> > >> HI, Dear R community,
> > >>>
> > >>> I have one data set like this, What I want to do is to
> > calculate the
> > >>> cumulative coverage. The following codes works for small
> > data set (#rows
> > >>> =
> > >>> 100), but when feed the whole data set, it still running
> > after 24 hours.
> > >>> Can someone give some suggestions for long vector?
> > >>>
> > >>> id reads
> > >>> Contig79:1 4
> > >>> Contig79:2 8
> > >>> Contig79:3 13
> > >>> Contig79:4 14
> > >>> Contig79:5 17
> > >>> Contig79:6 20
> > >>> Contig79:7 25
> > >>> Contig79:8 27
> > >>> Contig79:9 32
> > >>> Contig79:10 33
> > >>> Contig79:11 34
> > >>>
> > >>>
> > >>>
> > matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
> UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
> ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> > >>> sep="\t", skip=0, header=F,fill=T) #
> > >>> dim(matt)
> > >>> [1] 3384766 2
> > >>>
> > >>> matt_plot<-function(matt, outputfile) {
> > >>> names(matt)<-c("id","reads")
> > >>>
> > >>> cover<-matt$reads
> > >>>
> > >>>
> > >>> #calculate the cumulative coverage.
> > >>> + cover_per<-function (data) {
> > >>> + output<-numeric(0)
> > >>> + for (i in data) {
> > >>> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> > >>> + output<-c(output, x)
> > >>> + }
> > >>> + return(output)
> > >>> + }
> > >>>
> > >>>
> > >>> result<-cover_per(cover)
> > >>>
> > >>>
> > >>> Thanks so much!
> > >>>
> > >>>
> > >>> --
> > >>> Sincerely,
> > >>> Changbin
> > >>> --
> > >>>
> > >>> [[alternative HTML version deleted]]
> > >>>
> > >>> ______________________________________________
> > >>> R-help at r-project.org mailing list
> > >>> https://stat.ethz.ch/mailman/listinfo/r-help
> > >>> PLEASE do read the posting guide
> > >>> http://www.R-project.org/posting-guide.html
> > >>> and provide commented, minimal, self-contained,
> reproducible code.
> > >>>
> > >>>
> > >
> > >
> > > --
> > > Sincerely,
> > > Changbin
> > > --
> > >
> > > Changbin Du
> > > DOE Joint Genome Institute
> > > Bldg 400 Rm 457
> > > 2800 Mitchell Dr
> > > Walnut Creet, CA 94598
> > > Phone: 925-927-2856
> > >
> > >
> > >
> >
> >
> > --
> > Sincerely,
> > Changbin
> > --
> >
> > Changbin Du
> > DOE Joint Genome Institute
> > Bldg 400 Rm 457
> > 2800 Mitchell Dr
> > Walnut Creet, CA 94598
> > Phone: 925-927-2856
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list