[BioC] get promoter regions (bed format) of genes hg19

Thu Jan 23 00:06:09 CET 2014

Thanks, Val, for the details.  Now I have what is either a shameless plug or
a request for explanation.  matchGenes() in _bumphunter_ is similar, although
definitely not the same.  I am not sure how to translate between the results,
except to note that matchGenes uses nearest() with either select="arbitrary",
by default, or select="all" -- rather than findOverlaps().  It is much faster
than what you did, in effect because your transcriptsBy call is "compiled in".
At any rate, the user just does

	library(bumphunter)

	# this uses nearest(select="arbitrary")
	matchGenes(htr_genes_gr, promoterDist=2000)
or
	# this uses nearest(select="all")
	matchGenes(htr_genes_gr, promoterDist=2000, all=TRUE)

The results are annotated based on the matching transcript's TSS, but neither
answer has length 6 in this example.  Maybe someone can elucidate the other
differences (and perhaps some similarities).

On Jan 22, 2014, at 11:56 AM, Valerie Obenchain wrote:

> Hi Ninni,
> 
> A more complete description of the promoters() function is on the man page in the GenomicRanges package. You can find this by specifying the full method name:
> 
>  ?'promoters,GenomicRanges-method'
> 
> The GRanges or GRangesList you supply as 'x' is assumed to contain gene ranges. Here is a snip from the man page stating that the assumption is that start(x) is the TSS:
> 
>>        • ‘promoters’ returns an object of the same type and length as
>>          ‘x’ containing promoter ranges. Promoter ranges extend around
>>          the transcription start site (TSS) which is defined as
>>          ‘start(x)’. The ‘upsteam’ and ‘downstream’ arguments define
>>          the number of nucleotides in the 5' and 3' direction,
>>          respectively. The full range is defined as,
>> 
>>          (start(x) - upstream) to (start(x) + downstream - 1).
> 
> So promoters() simply returns the region upstream and/or downstream of the range you supply as 'x'. If you have a generic range and want to determine if it falls within a gene you can use findOverlaps() on a TranscriptDb and your ranges.
> 
>  library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>  genes <- transcriptsBy(txdb, "gene")
>  hits <- findOverlaps(htr_genes_gr, genes)
> 
> 'queryHits' is an index of 'htr_genes_gr' ranges that maps to the 'subjectHits' index of 'genes' ranges. Here we see that both range 2 and 4 in 'htr_genes_gr' each map to 2 different genes.
> 
>>> hits
>> Hits of length 6
>> queryLength: 4
>> subjectLength: 23459
>>  queryHits subjectHits
>>   <integer>   <integer>
>> 1         1       19185
>> 2         2       17448
>> 3         2       20105
>> 4         3       17448
>> 5         4        5026
>> 6         4        8270
> 
> Isolate the genes that the ranges in 'htr_genes_gr' hit:
> 
>  mygenes <- genes[subjectHits(hits)]
> 
> Call promoters() on the gene ranges of interest.
> 
>  promoters(mygenes, upstream=2000)
> 
> 
> Valerie
> 
> On 01/22/2014 01:12 AM, Ninni Nahm [guest] wrote:
>> 
>> Dear all,
>> 
>> I'm trying to obtain promoter regions of a list of genes.
>> This is my granges example:
>> 
>> htr_genes_gr = GRanges( seqnames = rep("chr1", 4),
>> ranges = IRanges(start = c(131025, 761586, 762988, 860260), end = c(134836, 794826, 794826, 879955)))
>> 
>> I wanted to use promoters(htr_genes_gr, upstream=2000)
>> to get the upstream regions. However, I get something, but Im afraid that it does not match with the USCS genome browser. So, I take the output and view it in IGV with hg19 loaded. But the promoter seems shifted, i.e. it does not start where the gene begins. When I look at the IRanges web page, there is an indication that the genome version might be hg18. I am not sure if this is the problem. Anyways, does anyone have a suggestion how I can get the bed file of the promoter regions. I need the promoter regions of all genes in the end.
>> Thanks a lot!
>> 
>> 
>> 
>>  -- output of sessionInfo():
>> 
>>  sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> 
>> locale:
>>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> --
>> Sent via the guest posting facility at bioconductor.org.
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> 
> -- 
> Valerie Obenchain
> 
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B155
> P.O. Box 19024
> Seattle, WA 98109-1024
> 
> E-mail: vobencha at fhcrc.org
> Phone:  (206) 667-3158
> Fax:    (206) 667-1319
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor