[BioC] IRanges:::coverage() speedup/enchancement

Charles C. Berry cberry at tajo.ucsd.edu
Tue Dec 1 20:30:40 CET 2009


On Tue, 1 Dec 2009, Patrick Aboyoun wrote:

> Chuck,
> Thanks for the speedup to coverage for large width IRanges objects. I checked 
> in changes to IRanges 1.5.13 in BioC 2.6 (for use with R-devel) based on the 
> code you submitted. I'll back port this code to BioC 2.5 (R 2.10) in the next 
> few days as well.

Thank you so much!

> Just out of curiosity, what is the source of these long 
> width intervals, where do the weights for these intervals come from, and what 
> operations do you perform on the resulting coverage vectors?

Re the source of the intervals:

I collaborate with Rick Bushman's group at Upenn 
(http://microb230.med.upenn.edu/), which studies 'the transfer of genetic 
information between cells and organisms'.

It is of particular interest to know where a mobile DNA element is 
preferentially sited in a genome. (And not just for scientific reasons, 
gene therapy constructs can do serious damage if they land in the wrong 
place, so designing good gene therapy vectors is aided by being able to 
predict where a vector tends to land.)

In our papers (for example Berry et al. Selection of Target Sites for 
Mobile DNA Integration in the Human Genome. PLoS Comput Biol 2(11): e157. 
doi:10.1371/journal.pcbi.0020157), it emerged that features like 'gene 
density' (number of genes per base in a window of a specified width), CpG 
island site density, and DNASE I site density have substantial effects on 
integration preference. The length scale for 'width' affects the 
conclusions, and widths greater than a megabase and perhaps to tens of 
megabases seem to matter.

---------------

Re the weights:

In performing the calculations of 'expression density' (counts per base 
of genes expressed above some threshhold value) using Affy arrays, I had 
to deal with varying numbers of probesets for different genes. It seemed 
to me that counting all probesets with equal weight would introduce 
unwanted variation, so I used an inverse weighting scheme so that the sum 
of weights for each gene would equal one.

Michael's workaround seems reasonable.

----------------

Re the operations performed on coverage:

You can browse the paper mentioned above to get the idea, but the 
operations are just a lookup of the coverage values (with some adjustments 
made for gaps and chromosome ends) followed by data analysis using those 
values.

In the idiom of the IRanges package (and simplified a bit for easier 
reading), something like this:

## jurkat is a data.frame describing HIV integration site locations 
## (aka 'insertions') and those of in silico controls ('match')

> xtabs( ~type, jurkat )
type
insertion     match
     39237    116161


all.chr <- levels(jurkat$Chr)

### locations as a RangedData instance:

jurkat.RL <- RangedData(IRanges(start=jurkat$Position,width=1),
 	space=jurkat$Chr, strand=jurkat$Ort,orig.rows=rownames(jurkat))

### get a 1MBase gene density mapping for geneHuman():

library(GenomicFeatures.Hsapiens.UCSC.hg18)
knownGenes <- geneHuman()
known.RL <- with( subset(knownGenes, chrom%in%all.chr ),
 		  RangedData( IRanges(start=txStart,en=txEnd),
                               space=factor(chrom, all.chr ),
                               strand=factor(strand,levels(jurkat$Ort))))
wider.RL <- known.RL
start( wider.RL ) <- start( wider.RL ) - 5e5
end( wider.RL ) <-  end( wider.RL ) + 5e5
wider.cover <- coverage( wider.RL, width = seqlengths(Hsapiens)[ names( wider.RL ) ] )

### now lookup the coverage (aka density) values for the insertions and 
### matches:

jurkat.wider.cover <- wider.cover[ ranges(jurkat.RL) ]

## now match the coverage values back to the parent data.frame and do some 
## statistical analysis:

## simple examples:

> table( as.vector( as.vector( jurkat.wider.cover ) ),
+	jurkat[ jurkat.RL$orig.rows, 'type' ] )


       insertion match
   0         993 14881
   1         568  7490
   2         843  6453
   3        1039  7292
   4        1044  6104
   5        1044  5590
   6        1045  5086
[snip]

>  lapply(split(as.vector(as.vector(jurkat.wider.cover)),
+	  jurkat[jurkat.RL$orig.rows,'type']),mean)

$insertion
[1] 52.6881

$match
[1] 22.71713

>

Chuck

> Echoing Michael's comments, we haven't supported double precision weights in 
> coverage calculations in the past because we hadn't encountered any common 
> use cases for them and there is the workaround Michael mentioned if the need 
> arose. Providing some context for the enhancement request would help motivate 
> us to make a change. :)
>
>
> Cheers,
> Patrick
>
>
>
> Michael Lawrence wrote:
>>  On Mon, Nov 30, 2009 at 11:10 AM, Charles C. Berry
>>  <cberry at tajo.ucsd.edu>wrote:
>>
>> 
>> >  The semantics of the IRanges package and especially the RangedData class
>> >  are very apprpriate for some of the applications I deal with.
>> > 
>> >  Unfortunately, coverage() is too slow to be useful to me.
>> > 
>> >  I wonder if the Biocore Team would consider retooling it to make it
>> >  faster? Below I provide a link to a revised coverage.c that might 
>> >  suffice.
>> > 
>> >  The kind of case I need to handle has width values in 10kbase to 10Mbase
>> >  range. As a toy example, being able to run stuff like
>> > 
>> >       tmp <- coverage( IRanges( start=seq(1,by=1000,length=10000),
>> >                         width=1e7 ) )
>> > 
>> >  quickly is needed.
>> > 
>> >  A revised version of coverage.c is available at
>> > 
>> >  http://cabig2.ucsd.edu:8080/Plone/Members/ccberry/software/coverage.c/view
>> > 
>> >  It will handle the case above almost instantly (while the existing 
>> >  version
>> >  needs about 8 minutes on my machine) and seems about equal to the
>> >  existing version for cases with width=30.  In the cases I've looked at
>> >  gc() reports the same memory usage.
>> > 
>> >  ---
>> > 
>> >  Also, I wonder if the Biocore Team would entertain allowing the 'weight'
>> >  argument of coverage to be of type double? This would help in cases in
>> >  which downweighting of counts of some genomic features is desired.
>> > 
>> > 
>> >
>>  In many use cases, it's probably sufficient to simply round floating point
>>  numbers to integers after multiplying by a power of 10. That only goes so
>>  far though, so supporting double-precision seems reasonable. The type of
>>  the
>>  output will simply depend on the type of the weights.
>> 
>> 
>>
>> 
>> >  Thanks,
>> > 
>> >  Chuck
>> > 
>> >  --
>> >  Charles C. Berry                            (858) 534-2098
>> >                                             Dept of Family/Preventive
>> >  Medicine
>> >  E mailto:cberry at tajo.ucsd.edu               UC San Diego
>> >  http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 
>> >  92093-0901
>> > 
>> >  _______________________________________________
>> >  Bioconductor mailing list
>> >  Bioconductor at stat.math.ethz.ch
>> >  https://stat.ethz.ch/mailman/listinfo/bioconductor
>> >  Search the archives:
>> >  http://news.gmane.org/gmane.science.biology.informatics.conductor
>> > 
>> > 
>>
>>   [[alternative HTML version deleted]]
>>
>>  _______________________________________________
>>  Bioconductor mailing list
>>  Bioconductor at stat.math.ethz.ch
>>  https://stat.ethz.ch/mailman/listinfo/bioconductor
>>  Search the archives:
>>  http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the Bioconductor mailing list