[BioC] get promoter regions (bed format) of genes hg19
Harris A. Jaffee
hj at jhu.edu
Thu Jan 23 00:06:09 CET 2014
Thanks, Val, for the details. Now I have what is either a shameless plug or
a request for explanation. matchGenes() in _bumphunter_ is similar, although
definitely not the same. I am not sure how to translate between the results,
except to note that matchGenes uses nearest() with either select="arbitrary",
by default, or select="all" -- rather than findOverlaps(). It is much faster
than what you did, in effect because your transcriptsBy call is "compiled in".
At any rate, the user just does
library(bumphunter)
# this uses nearest(select="arbitrary")
matchGenes(htr_genes_gr, promoterDist=2000)
or
# this uses nearest(select="all")
matchGenes(htr_genes_gr, promoterDist=2000, all=TRUE)
The results are annotated based on the matching transcript's TSS, but neither
answer has length 6 in this example. Maybe someone can elucidate the other
differences (and perhaps some similarities).
On Jan 22, 2014, at 11:56 AM, Valerie Obenchain wrote:
> Hi Ninni,
>
> A more complete description of the promoters() function is on the man page in the GenomicRanges package. You can find this by specifying the full method name:
>
> ?'promoters,GenomicRanges-method'
>
> The GRanges or GRangesList you supply as 'x' is assumed to contain gene ranges. Here is a snip from the man page stating that the assumption is that start(x) is the TSS:
>
>> • ‘promoters’ returns an object of the same type and length as
>> ‘x’ containing promoter ranges. Promoter ranges extend around
>> the transcription start site (TSS) which is defined as
>> ‘start(x)’. The ‘upsteam’ and ‘downstream’ arguments define
>> the number of nucleotides in the 5' and 3' direction,
>> respectively. The full range is defined as,
>>
>> (start(x) - upstream) to (start(x) + downstream - 1).
>
> So promoters() simply returns the region upstream and/or downstream of the range you supply as 'x'. If you have a generic range and want to determine if it falls within a gene you can use findOverlaps() on a TranscriptDb and your ranges.
>
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> genes <- transcriptsBy(txdb, "gene")
> hits <- findOverlaps(htr_genes_gr, genes)
>
> 'queryHits' is an index of 'htr_genes_gr' ranges that maps to the 'subjectHits' index of 'genes' ranges. Here we see that both range 2 and 4 in 'htr_genes_gr' each map to 2 different genes.
>
>>> hits
>> Hits of length 6
>> queryLength: 4
>> subjectLength: 23459
>> queryHits subjectHits
>> <integer> <integer>
>> 1 1 19185
>> 2 2 17448
>> 3 2 20105
>> 4 3 17448
>> 5 4 5026
>> 6 4 8270
>
> Isolate the genes that the ranges in 'htr_genes_gr' hit:
>
> mygenes <- genes[subjectHits(hits)]
>
> Call promoters() on the gene ranges of interest.
>
> promoters(mygenes, upstream=2000)
>
>
> Valerie
>
> On 01/22/2014 01:12 AM, Ninni Nahm [guest] wrote:
>>
>> Dear all,
>>
>> I'm trying to obtain promoter regions of a list of genes.
>> This is my granges example:
>>
>> htr_genes_gr = GRanges( seqnames = rep("chr1", 4),
>> ranges = IRanges(start = c(131025, 761586, 762988, 860260), end = c(134836, 794826, 794826, 879955)))
>>
>> I wanted to use promoters(htr_genes_gr, upstream=2000)
>> to get the upstream regions. However, I get something, but Im afraid that it does not match with the USCS genome browser. So, I take the output and view it in IGV with hg19 loaded. But the promoter seems shifted, i.e. it does not start where the gene begins. When I look at the IRanges web page, there is an indication that the genome version might be hg18. I am not sure if this is the problem. Anyways, does anyone have a suggestion how I can get the bed file of the promoter regions. I need the promoter regions of all genes in the end.
>> Thanks a lot!
>>
>>
>>
>> -- output of sessionInfo():
>>
>> sessionInfo()
>> R version 3.0.2 (2013-09-25)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> --
>> Sent via the guest posting facility at bioconductor.org.
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
> --
> Valerie Obenchain
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B155
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: vobencha at fhcrc.org
> Phone: (206) 667-3158
> Fax: (206) 667-1319
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list