[R] Is that an efficient way to find the overlapped , upstream and downstream rangess for a bunch of rangess

Yao He yao.h.1988 at gmail.com
Tue Apr 5 19:29:36 CEST 2016


I do have a bunch of genes ( nearly ~50000)  from the whole genome, which
read in genomic ranges

A range(gene) can be seem as an observation has three columns chromosome,
start and end, like that

       seqnames start end width strand

gene1     chr1     1   5     5      +

gene2     chr1    10  15     6      +

gene3     chr1    12  17     6      +

gene4     chr1    20  25     6      +

gene5     chr1    30  40    11      +

I just wondering is there an efficient way to find *overlapped, upstream
and downstream genes for each gene in the granges*

For example, assuming all_genes_gr is a ~50000 genes genomic range, the
result I want like belows:
gene_name upstream_gene downstream_gene overlapped_gene
gene1 NA gene2 NA
gene2 gene1 gene4 gene3
gene3 gene1 gene4 gene2
gene4 gene3 gene5 NA

Currently ,  the strategy I use is like that,

library(GenomicRanges)

find_overlapped_gene <- function(idx, all_genes_gr) {
  #cat(idx, "\n")
  curr_gene <- all_genes_gr[idx]
  other_genes <- all_genes_gr[-idx]
  n <- countOverlaps(curr_gene, other_genes)
  gene <- subsetByOverlaps(curr_gene, other_genes)
  return(list(n, gene))
}​

system.time(lapply(1:100, function(idx)  find_overlapped_gene(idx,
all_genes_gr)))

However, for 100 genes, it use nearly ~8s by system.time().That means if I
had 50000 genes, nearly one hour for just find overlapped gene.

I am just wondering any algorithm or strategy to do that efficiently,
perhaps 50000 genes in ~10min or even less

Yao He

	[[alternative HTML version deleted]]



More information about the R-help mailing list