[R] how to work with long vectors
William Dunlap
wdunlap at tibco.com
Fri Nov 5 17:58:04 CET 2010
The following cover_per_3 uses sorting to solve
the problem more quickly. It still has room
for improvement.
cover_per_3 <- function (data)
{
n <- length(data)
o <- rev(order(data))
sdata <- data[o]
r <- rle(sdata)$lengths
output <- numeric(n)
output[o] <- rep(cumsum(r), r)
100 * output/n
}
(The ecdf function would probabably also do the
job quickly.)
When trying to work on problems like this I find
it most fruitful to work on smaller datasets and
see how the time grows with the size of the data,
instead of seeing how many days a it takes on a huge
dataset. E.g., the following compares times for
your original function, Phil Spector's simple cleanup
of your function, and the sort based approach for
vectors of length 5 and 10 thousand.
> z<-sample(5e3, size=1e4, replace=TRUE) ; print(system.time(v <-
cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
print(system.time(v_3 <- cover_per_3(z)))
user system elapsed
38.21 0.00 38.41
user system elapsed
0.86 0.00 0.86
user system elapsed
0 0 0
> identical(v_3,v)
[1] TRUE
> z<-sample(1e4, size=2e4, replace=TRUE) ; print(system.time(v <-
cover_per(z))) ; print(system.time(v_2 <- cover_per_2(z))) ;
print(system.time(v_3 <- cover_per_3(z)))
user system elapsed
158.48 0.07 159.31
user system elapsed
3.23 0.00 3.25
user system elapsed
0.02 0.00 0.02
> identical(v_3,v)
[1] TRUE
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Changbin Du
> Sent: Friday, November 05, 2010 9:14 AM
> To: Phil Spector
> Cc: <r-help at r-project.org>
> Subject: Re: [R] how to work with long vectors
>
> HI, Phil,
>
> I used the following codes and run it overnight for 15 hours,
> this morning,
> I stopped it. It seems it is still not efficient.
>
>
> >
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
> > names(matt)<-c("id","reads")
>
> > dim(matt)
> [1] 3384766 2
>
> > cover<-matt$reads
>
> > cover_per_2 <- function(data){
> + l = length(data)
> + output = numeric(l)
> + for(i in 1:l)output[i] = sum(data >= data[i])
> + 100 * output / l
> + }
>
> > result3<-cover_per_2(cover)
>
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Nov 4, 2010 at 10:37 AM, Changbin Du
> <changbind at gmail.com> wrote:
>
> > Thanks Phil, that is great! I WILL try this and let you
> know how it goes.
> >
> >
> >
> >
> > On Thu, Nov 4, 2010 at 10:16 AM, Phil Spector
> <spector at stat.berkeley.edu>wrote:
> >
> >> Changbin -
> >> Does
> >>
> >> 100 * sapply(matt$reads,function(x)sum(matt$reads >=
> >> x))/length(matt$reads)
> >>
> >> give what you want?
> >>
> >> By the way, if you want to use a loop (there's nothing
> wrong with
> >> that),
> >> then try to avoid the most common mistake that people make
> with loops in
> >> R:
> >> having your result grow inside the loop. Here's a better
> way to use a
> >> loop
> >> to solve your problem:
> >>
> >> cover_per_1 <- function(data){
> >> l = length(data)
> >> output = numeric(l)
> >> for(i in 1:l)output[i] = 100 * sum(ifelse(data >= data[i], 1,
> >> 0))/length(data)
> >> output
> >> }
> >>
> >> Using some random data, and comparing to your original
> cover_per function:
> >>
> >> dat = rnorm(1000)
> >>> system.time(one <- cover_per(dat))
> >>>
> >> user system elapsed
> >> 0.816 0.000 0.824
> >>
> >>> system.time(two <- cover_per_1(dat))
> >>>
> >> user system elapsed
> >> 0.792 0.000 0.805
> >>
> >> Not that big a speedup, but it does increase quite a bit
> as the problem
> >> gets
> >> larger.
> >>
> >> There are two obvious ways to speed up your function:
> >> 1) Eliminate the ifelse function, since automatic coersion from
> >> logical to numeric does the same thing.
> >> 2) Multiply by 100 and divide by the length outside the loop:
> >>
> >> cover_per_2 <- function(data){
> >> l = length(data)
> >> output = numeric(l)
> >> for(i in 1:l)output[i] = sum(data >= data[i])
> >> 100 * output / l
> >> }
> >>
> >> system.time(three <- cover_per_2(dat))
> >>>
> >> user system elapsed
> >> 0.024 0.000 0.027
> >>
> >> That makes the loop just about equivalent to the sapply solution:
> >>
> >> system.time(four <- 100*sapply(dat,function(x)sum(dat >=
> x))/length(dat))
> >>>
> >> user system elapsed
> >> 0.024 0.000 0.026
> >>
> >> - Phil Spector
> >> Statistical
> Computing Facility
> >> Department of Statistics
> >> UC Berkeley
> >> spector at stat.berkeley.edu
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Thu, 4 Nov 2010, Changbin Du wrote:
> >>
> >> HI, Dear R community,
> >>>
> >>> I have one data set like this, What I want to do is to
> calculate the
> >>> cumulative coverage. The following codes works for small
> data set (#rows
> >>> =
> >>> 100), but when feed the whole data set, it still running
> after 24 hours.
> >>> Can someone give some suggestions for long vector?
> >>>
> >>> id reads
> >>> Contig79:1 4
> >>> Contig79:2 8
> >>> Contig79:3 13
> >>> Contig79:4 14
> >>> Contig79:5 17
> >>> Contig79:6 20
> >>> Contig79:7 25
> >>> Contig79:8 27
> >>> Contig79:9 32
> >>> Contig79:10 33
> >>> Contig79:11 34
> >>>
> >>>
> >>>
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILL
UMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GW>
ZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> >>> sep="\t", skip=0, header=F,fill=T) #
> >>> dim(matt)
> >>> [1] 3384766 2
> >>>
> >>> matt_plot<-function(matt, outputfile) {
> >>> names(matt)<-c("id","reads")
> >>>
> >>> cover<-matt$reads
> >>>
> >>>
> >>> #calculate the cumulative coverage.
> >>> + cover_per<-function (data) {
> >>> + output<-numeric(0)
> >>> + for (i in data) {
> >>> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> >>> + output<-c(output, x)
> >>> + }
> >>> + return(output)
> >>> + }
> >>>
> >>>
> >>> result<-cover_per(cover)
> >>>
> >>>
> >>> Thanks so much!
> >>>
> >>>
> >>> --
> >>> Sincerely,
> >>> Changbin
> >>> --
> >>>
> >>> [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help at r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >
> >
> > --
> > Sincerely,
> > Changbin
> > --
> >
> > Changbin Du
> > DOE Joint Genome Institute
> > Bldg 400 Rm 457
> > 2800 Mitchell Dr
> > Walnut Creet, CA 94598
> > Phone: 925-927-2856
> >
> >
> >
>
>
> --
> Sincerely,
> Changbin
> --
>
> Changbin Du
> DOE Joint Genome Institute
> Bldg 400 Rm 457
> 2800 Mitchell Dr
> Walnut Creet, CA 94598
> Phone: 925-927-2856
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list