[BioC] get promoter regions (bed format) of genes hg19
Valerie Obenchain
vobencha at fhcrc.org
Wed Jan 22 17:56:25 CET 2014
Hi Ninni,
A more complete description of the promoters() function is on the man
page in the GenomicRanges package. You can find this by specifying the
full method name:
?'promoters,GenomicRanges-method'
The GRanges or GRangesList you supply as 'x' is assumed to contain gene
ranges. Here is a snip from the man page stating that the assumption is
that start(x) is the TSS:
> • ‘promoters’ returns an object of the same type and length as
> ‘x’ containing promoter ranges. Promoter ranges extend around
> the transcription start site (TSS) which is defined as
> ‘start(x)’. The ‘upsteam’ and ‘downstream’ arguments define
> the number of nucleotides in the 5' and 3' direction,
> respectively. The full range is defined as,
>
> (start(x) - upstream) to (start(x) + downstream - 1).
So promoters() simply returns the region upstream and/or downstream of
the range you supply as 'x'. If you have a generic range and want to
determine if it falls within a gene you can use findOverlaps() on a
TranscriptDb and your ranges.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
genes <- transcriptsBy(txdb, "gene")
hits <- findOverlaps(htr_genes_gr, genes)
'queryHits' is an index of 'htr_genes_gr' ranges that maps to the
'subjectHits' index of 'genes' ranges. Here we see that both range 2 and
4 in 'htr_genes_gr' each map to 2 different genes.
>> hits
> Hits of length 6
> queryLength: 4
> subjectLength: 23459
> queryHits subjectHits
> <integer> <integer>
> 1 1 19185
> 2 2 17448
> 3 2 20105
> 4 3 17448
> 5 4 5026
> 6 4 8270
Isolate the genes that the ranges in 'htr_genes_gr' hit:
mygenes <- genes[subjectHits(hits)]
Call promoters() on the gene ranges of interest.
promoters(mygenes, upstream=2000)
Valerie
On 01/22/2014 01:12 AM, Ninni Nahm [guest] wrote:
>
> Dear all,
>
> I'm trying to obtain promoter regions of a list of genes.
> This is my granges example:
>
> htr_genes_gr = GRanges( seqnames = rep("chr1", 4),
> ranges = IRanges(start = c(131025, 761586, 762988, 860260), end = c(134836, 794826, 794826, 879955)))
>
> I wanted to use promoters(htr_genes_gr, upstream=2000)
> to get the upstream regions. However, I get something, but Im afraid that it does not match with the USCS genome browser. So, I take the output and view it in IGV with hg19 loaded. But the promoter seems shifted, i.e. it does not start where the gene begins. When I look at the IRanges web page, there is an indication that the genome version might be hg18. I am not sure if this is the problem. Anyways, does anyone have a suggestion how I can get the bed file of the promoter regions. I need the promoter regions of all genes in the end.
> Thanks a lot!
>
>
>
> -- output of sessionInfo():
>
> sessionInfo()
> R version 3.0.2 (2013-09-25)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Valerie Obenchain
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: vobencha at fhcrc.org
Phone: (206) 667-3158
Fax: (206) 667-1319
More information about the Bioconductor
mailing list