[R] how to work with long vectors
Martin Morgan
mtmorgan at fhcrc.org
Fri Nov 5 17:42:47 CET 2010
On 11/05/2010 09:13 AM, Changbin Du wrote:
> HI, Phil,
>
> I used the following codes and run it overnight for 15 hours, this morning,
> I stopped it. It seems it is still not efficient.
>
>
>>
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
>> names(matt)<-c("id","reads")
>
>> dim(matt)
> [1] 3384766 2
[snip]
>>> On Thu, 4 Nov 2010, Changbin Du wrote:
>>>
>>> HI, Dear R community,
>>>>
>>>> I have one data set like this, What I want to do is to calculate the
>>>> cumulative coverage. The following codes works for small data set (#rows
>>>> =
>>>> 100), but when feed the whole data set, it still running after 24 hours.
>>>> Can someone give some suggestions for long vector?
>>>>
>>>> id reads
>>>> Contig79:1 4
>>>> Contig79:2 8
>>>> Contig79:3 13
>>>> Contig79:4 14
>>>> Contig79:5 17
>>>> Contig79:6 20
>>>> Contig79:7 25
>>>> Contig79:8 27
>>>> Contig79:9 32
>>>> Contig79:10 33
>>>> Contig79:11 34
>>>>
>>>>
>>>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>>>> sep="\t", skip=0, header=F,fill=T) #
>>>> dim(matt)
>>>> [1] 3384766 2
>>>>
>>>> matt_plot<-function(matt, outputfile) {
>>>> names(matt)<-c("id","reads")
>>>>
>>>> cover<-matt$reads
>>>>
>>>>
>>>> #calculate the cumulative coverage.
>>>> + cover_per<-function (data) {
>>>> + output<-numeric(0)
>>>> + for (i in data) {
>>>> + x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>>>> + output<-c(output, x)
>>>> + }
>>>> + return(output)
>>>> + }
>>>>
>>>>
>>>> result<-cover_per(cover)
Hi Changbin
If I understand correctly, your contigs 'start' at position 1, and have
'width' equal to matt$reads. You'd like to know the coverage at the last
covered location of each contig in matt$reads.
## first time only
source("http://bioconductor.org")
biocLite("IRanges")
##
library(IRanges)
contigs = IRanges(start=1, width=matt$reads)
cvg = coverage(contigs) ## an RLE summarizing coverage, from position 1
as.vector(cvg[matt$reads]) / nrow(matt) ## at the end of each contig
for a larger data set:
> matt=data.frame(reads=ceiling(as.integer(runif(3384766, 1, 100))))
> contigs = IRanges(start=1, width=matt$reads)
> system.time(cvg <- coverage(contigs))
user system elapsed
5.145 0.050 5.202
Martin
>>>>
>>>>
>>>> Thanks so much!
>>>>
>>>>
>>>> --
>>>> Sincerely,
>>>> Changbin
>>>> --
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>
>>
>> --
>> Sincerely,
>> Changbin
>> --
>>
>> Changbin Du
>> DOE Joint Genome Institute
>> Bldg 400 Rm 457
>> 2800 Mitchell Dr
>> Walnut Creet, CA 94598
>> Phone: 925-927-2856
>>
>>
>>
>
>
--
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
Location: M1-B861
Telephone: 206 667-2793
More information about the R-help
mailing list