[BioC] bed file around the TSSs
Vincent Carey
stvjc at channing.harvard.edu
Wed Oct 13 23:05:31 CEST 2010
There are various approaches but basic tools for one solution are
1) use GenomicFeatures::makeTranscriptTableFromUCSC(genome, table) to acquire
a SQLite database that has relevant information
2) extract the GRanges object corresponding to transcripts and their
genomic coordinates
using the transcripts() method
3) set the names of the GRanges using something like names(gr) =
elementMetadata(gr)$tx_name -- but see below; you might instead coerce
to RangedData and just change the column names
4) revise the coordinate information according to your wishes
5) use the rtracklayer export method on the revised GRanges or
RangedData structure to create the bed file
some concrete results, based on
> sessionInfo()
R version 2.13.0 Under development (unstable) (2010-10-12 r53293)
Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
locale:
[1] C
attached base packages:
[1] stats graphics grDevices datasets tools utils methods
[8] base
other attached packages:
[1] rtracklayer_1.9.9 RCurl_1.4-3 bitops_1.0-4.1
[4] GenomicFeatures_1.1.12 GenomicRanges_1.1.29 IRanges_1.7.35
[7] weaver_1.15.0 codetools_0.2-2 digest_0.4.2
loaded via a namespace (and not attached):
[1] BSgenome_1.17.7 Biobase_2.9.2 Biostrings_2.17.47 DBI_0.2-5
[5] RSQLite_0.9-2 XML_3.1-1 biomaRt_2.5.1
> t1 = makeTranscriptDbFromUCSC("hg19", "refGene")
Download the refGene table ... OK
Download the refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TranscriptDb object ... OK
There were 50 or more warnings (use warnings() to see the first 50)
> tt1 = transcripts(t1)
> tt1
GRanges with 37195 ranges and 2 elementMetadata values
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 [ 69091, 70008] + | 980 NM_001005484
[2] chr1 [323892, 328581] + | 985 NR_028327
[3] chr1 [323892, 328581] + | 986 NR_028322
[4] chr1 [323892, 328581] + | 987 NR_028325
[5] chr1 [367659, 368597] + | 984 NM_001005277
> names(tt1) = make.names(elementMetadata(tt1)$tx_name, unique=TRUE) # why?
> tt1[1:5]
GRanges with 5 ranges and 2 elementMetadata values
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
NM_001005484 chr1 [ 69091, 70008] + | 980 NM_001005484
NR_028327 chr1 [323892, 328581] + | 985 NR_028327
NR_028322 chr1 [323892, 328581] + | 986 NR_028322
NR_028325 chr1 [323892, 328581] + | 987 NR_028325
NM_001005277 chr1 [367659, 368597] + | 984 NM_001005277
seqlengths
chr1 chr2 ... chr18_gl000207_random
249250621 243199373 ... 4262
observe what happens when the tx_names are duplicated
> tt1[42:44]
GRanges with 3 ranges and 2 elementMetadata values
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
NR_002946 chr1 [1568159, 1570027] + | 1053 NR_002946
NR_002946.1 chr1 [1631378, 1633247] + | 1063 NR_002946
NM_138705 chr1 [1846266, 1848733] + | 1066 NM_138705
seqlengths
chr1 chr2 ... chr18_gl000207_random
249250621 243199373 ... 4262
if you coerce to RangedData and change tx_name to name you can
probably avoid the munging of the transcript names,
which in this case, if you use the approach above, could be pretty
confusing if .1 etc are used to indicate versions at refseq
> export(tt1, "tt1.bed")
> cat(readLines("tt1.bed", n=5), sep="\n")
chr1 69090 70008 NM_001005484 0 +
chr1 323891 328581 NR_028327 0 +
chr1 323891 328581 NR_028322 0 +
chr1 323891 328581 NR_028325 0 +
chr1 367658 368597 NM_001005277 0 +
i have skipped the modifications to the coordinates (you will need to program
that conditional on strand, presumably) and the setting of the header line.
On Wed, Oct 13, 2010 at 3:53 PM, joseph <jdsandjd at yahoo.com> wrote:
> Hi
> I would highly appreciate if you could show me how to create a bed file around
> the TSSs from UCSC databases such as ensembl or refSeq genes. I need 350
> nucleotides upstream and 150 nucleotides downstream of TSSs. The bed file should
> look like below, where:
> chromStart is 350 nucleotides upstream of TSS
> chromEnd is 150 nucleotides downstream of TSS
> name is Name of gene or transcript_id depending on the database.
>
>
>
> chrom chromStart chromEnd name score strand
> chr1 67051159 67163158 NM_024763 0 -
> chr1 67075869 67163158 NM_207014 0 +
> chr1 16762998 16812569 NM_017940 0 -
>
> Thanks
> Joseph
>
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list