[R] how to work with long vectors

Thu Nov 4 18:06:56 CET 2010

On 11/04/2010 09:45 AM, Changbin Du wrote:
> Thanks, Jim!
> 
> This is not what I want,  What I want is calculate the percentage of reads
> bigger or equal to that reads in each position.MY output is like the
> following:

Hi Changbin -- I might be repeating myself, but the Bioconductor
packages IRanges and GenomicRanges are designed to work with this sort
of data, and include 'coverage' functions that do what you're interested
in. Look into ?GRanges if interested.

http://bioconductor.org/help/bioc-views/release/BiocViews.html#___HighThroughputSequencing

Martin

> for row 1, all the reads is >= 4, so the cover_per is 100,
> for row 2, 99 % reads >=4, so the cover_per is 99.
>> head(final)
>   cover_per reads
> 1       100     4
> 2        99     8
> 3        98    13
> 4        97    14
> 5        96    17
> 6        95    20
> 
> I attached the input file with this email. This file is only 100 rows, very
> small. MY original data set is 3384766 rows.
> 
>  >
> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
> sep="\t", skip=0, header=F,fill=T) #
>> dim(matt)
> [1] 3384766       2
> 
> Thanks so much for your time!
> 
>> matt<-read.table("/home/cdu/operon/dimer5_0623/matt_test.txt", sep="\t",
> skip=0, header=F,fill=T) #
>> names(matt)<-c("id","reads")
>> dim(matt)
> [1] 100   2
>> cover<-matt$reads
>>
>>
>> #calculate the cumulative coverage.
>> cover_per<-function (data) {
> + output<-numeric(0)
> + for (i in data) {
> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
> +           output<-c(output, x)
> +                 }
> + return(output)
> + }
>>
>>
>> result<-cover_per(cover)
>> head(result)
> [1] 100  99  98  97  96  95
>>
>> final<-data.frame(result, cover)
>>
>> names(final)<-c("cover_per", "reads")
>> head(final)
>   cover_per reads
> 1       100     4
> 2        99     8
> 3        98    13
> 4        97    14
> 5        96    17
> 6        95    20
> 
> 
> 
> 
> 
> On Thu, Nov 4, 2010 at 9:18 AM, jim holtman <jholtman at gmail.com> wrote:
> 
>> Is this what you want:
>>
>>> x
>>            id reads
>> 1   Contig79:1     4
>> 2   Contig79:2     8
>> 3   Contig79:3    13
>> 4   Contig79:4    14
>> 5   Contig79:5    17
>> 6   Contig79:6    20
>> 7   Contig79:7    25
>> 8   Contig79:8    27
>> 9   Contig79:9    32
>> 10 Contig79:10    33
>> 11 Contig79:11    34
>>> x$percent <- x$reads / max(x$reads) * 100
>>> x
>>            id reads   percent
>> 1   Contig79:1     4  11.76471
>> 2   Contig79:2     8  23.52941
>> 3   Contig79:3    13  38.23529
>> 4   Contig79:4    14  41.17647
>> 5   Contig79:5    17  50.00000
>> 6   Contig79:6    20  58.82353
>> 7   Contig79:7    25  73.52941
>> 8   Contig79:8    27  79.41176
>> 9   Contig79:9    32  94.11765
>> 10 Contig79:10    33  97.05882
>> 11 Contig79:11    34 100.00000
>>
>>
>> On Thu, Nov 4, 2010 at 11:46 AM, Changbin Du <changbind at gmail.com> wrote:
>>> HI, Dear R community,
>>>
>>> I have one data set like this,  What I want to do is to calculate the
>>> cumulative coverage. The following codes works for small data set (#rows
>> =
>>> 100), but when feed the whole data set,  it still running after 24 hours.
>>> Can someone give some suggestions for long vector?
>>>
>>> id                reads
>>> Contig79:1    4
>>> Contig79:2    8
>>> Contig79:3    13
>>> Contig79:4    14
>>> Contig79:5    17
>>> Contig79:6    20
>>> Contig79:7    25
>>> Contig79:8    27
>>> Contig79:9    32
>>> Contig79:10    33
>>> Contig79:11    34
>>>
>>>
>> matt<-read.table("/house/groupdirs/genetic_analysis/mjblow/ILLUMINA_ONLY_MICROBIAL_GENOME_ASSEMBLY/4083340/STANDARD_LIBRARY/GWZW.994.5.1129.trim_69.fastq.19621832.sub.sorted.bam.clone.depth",
>>> sep="\t", skip=0, header=F,fill=T) #
>>> dim(matt)
>>> [1] 3384766       2
>>>
>>> matt_plot<-function(matt, outputfile) {
>>> names(matt)<-c("id","reads")
>>>
>>>  cover<-matt$reads
>>>
>>>
>>> #calculate the cumulative coverage.
>>> + cover_per<-function (data) {
>>> + output<-numeric(0)
>>> + for (i in data) {
>>> +           x<-(100*sum(ifelse(data >= i, 1, 0))/length(data))
>>> +           output<-c(output, x)
>>> +                 }
>>> + return(output)
>>> + }
>>>
>>>
>>>  result<-cover_per(cover)
>>>
>>>
>>> Thanks so much!
>>>
>>>
>>> --
>>> Sincerely,
>>> Changbin
>>> --
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>>
> 
> 
> 
> 
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793