[BioC] find overlapping regions
Martin Morgan
mtmorgan at fhcrc.org
Tue May 20 15:35:02 CEST 2008
Hi Marten --
<M.Boetzer at lumc.nl> writes:
> Dear list,
>
> i have a single region with a start and an end, where start < end. I want to find regions that have an overlap of more than 50% with that region. The regions to compare with are within a dataframe with starts and ends positions:
>
> start = 133375983
> end = 146245512
>
> data = data.frame(c(133470532, 133966699, 134162735, 134236863, 146225580), c(133754071, 133969713, 134163857, 134249655,156245512))
> colnames(data) = c("start2", "end2")
>
>> data
> start2 end2
> 1 133470532 133754071
> 2 133966699 133969713
> 3 134162735 134163857
> 4 134236863 134249655
> 5 146225580 156245512
>
> I've already made some code which did the trick, however, when the size of reg1 becomes very large, it will really slow down:
>
>
> regfound = c()
> reg1 = seq(start, end, 1)
> for(i in 1:nrow(data)){
> eq_reg = sum(is.element(seq(data$start2[i], data$end2[i], 1), reg1)==T)
> if(eq_reg!=0)
> regfound = c(regfound, round(eq_reg/((data$end2[i]-data$start2[i])+1)*100,1))
> else
> regfound = c(regfound,F)
> }
>
>>regfound
> [1] 100.0 100.0 100.0 100.0 0.2
Probably the key is to simplify how the overlapping region is found,
and then to vectorize the calculation. Maybe something along the
lines of
> width <- data$end2 - data$start2
> olap <- (pmin(end, data$end2) - pmax(start, data$start2)) / width
> olap > .5
[1] TRUE TRUE TRUE TRUE FALSE
?
Martin
>
> Does anyone know a faster or more elegant way of doing this?
>
> Thanks in advance,
> Marten
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
More information about the Bioconductor
mailing list