[BioC] bed file around the TSSs

Wed Oct 13 23:05:31 CEST 2010

There are various approaches but basic tools for one solution are
1) use GenomicFeatures::makeTranscriptTableFromUCSC(genome, table) to acquire
a SQLite database that has relevant information
2) extract the GRanges object corresponding to transcripts and their
genomic coordinates
using the transcripts() method
3) set the names of the GRanges using something like names(gr) =
elementMetadata(gr)$tx_name -- but see below; you might instead coerce
to RangedData and just change the column names
4) revise the coordinate information according to your wishes
5) use the rtracklayer export method on the revised GRanges or
RangedData structure to create the bed file

some concrete results, based on

> sessionInfo()
R version 2.13.0 Under development (unstable) (2010-10-12 r53293)
Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices datasets  tools     utils     methods
[8] base

other attached packages:
[1] rtracklayer_1.9.9      RCurl_1.4-3            bitops_1.0-4.1
[4] GenomicFeatures_1.1.12 GenomicRanges_1.1.29   IRanges_1.7.35
[7] weaver_1.15.0          codetools_0.2-2        digest_0.4.2

loaded via a namespace (and not attached):
[1] BSgenome_1.17.7    Biobase_2.9.2      Biostrings_2.17.47 DBI_0.2-5
[5] RSQLite_0.9-2      XML_3.1-1          biomaRt_2.5.1

> t1 = makeTranscriptDbFromUCSC("hg19", "refGene")
Download the refGene table ... OK
Download the refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TranscriptDb object ... OK
There were 50 or more warnings (use warnings() to see the first 50)
> tt1 = transcripts(t1)
> tt1
GRanges with 37195 ranges and 2 elementMetadata values
        seqnames               ranges strand   |     tx_id      tx_name
           <Rle>            <IRanges>  <Rle>   | <integer>  <character>
    [1]     chr1     [ 69091,  70008]      +   |       980 NM_001005484
    [2]     chr1     [323892, 328581]      +   |       985    NR_028327
    [3]     chr1     [323892, 328581]      +   |       986    NR_028322
    [4]     chr1     [323892, 328581]      +   |       987    NR_028325
    [5]     chr1     [367659, 368597]      +   |       984 NM_001005277

> names(tt1) = make.names(elementMetadata(tt1)$tx_name, unique=TRUE)  # why?
> tt1[1:5]
GRanges with 5 ranges and 2 elementMetadata values
             seqnames           ranges strand |     tx_id      tx_name
                <Rle>        <IRanges>  <Rle> | <integer>  <character>
NM_001005484     chr1 [ 69091,  70008]      + |       980 NM_001005484
   NR_028327     chr1 [323892, 328581]      + |       985    NR_028327
   NR_028322     chr1 [323892, 328581]      + |       986    NR_028322
   NR_028325     chr1 [323892, 328581]      + |       987    NR_028325
NM_001005277     chr1 [367659, 368597]      + |       984 NM_001005277

seqlengths
                  chr1                  chr2 ... chr18_gl000207_random
             249250621             243199373 ...                  4262

observe what happens when the tx_names are duplicated

> tt1[42:44]
GRanges with 3 ranges and 2 elementMetadata values
            seqnames             ranges strand |     tx_id     tx_name
               <Rle>          <IRanges>  <Rle> | <integer> <character>
  NR_002946     chr1 [1568159, 1570027]      + |      1053   NR_002946
NR_002946.1     chr1 [1631378, 1633247]      + |      1063   NR_002946
  NM_138705     chr1 [1846266, 1848733]      + |      1066   NM_138705

seqlengths
                  chr1                  chr2 ... chr18_gl000207_random
             249250621             243199373 ...                  4262

if you coerce to RangedData and change tx_name to name you can
probably avoid the munging of the transcript names,
which in this case, if you use the approach above, could be pretty
confusing if .1 etc are used to indicate versions at refseq

> export(tt1, "tt1.bed")
> cat(readLines("tt1.bed", n=5), sep="\n")
chr1    69090   70008   NM_001005484    0       +
chr1    323891  328581  NR_028327       0       +
chr1    323891  328581  NR_028322       0       +
chr1    323891  328581  NR_028325       0       +
chr1    367658  368597  NM_001005277    0       +

i have skipped the modifications to the coordinates (you will need to program
that conditional on strand, presumably) and the setting of the header line.

On Wed, Oct 13, 2010 at 3:53 PM, joseph <jdsandjd at yahoo.com> wrote:
> Hi
> I would highly appreciate if you could show me how to create a bed file around
> the TSSs from UCSC databases such as ensembl or refSeq genes. I need 350
> nucleotides upstream and 150 nucleotides downstream of TSSs. The bed file should
> look like below, where:
> chromStart is 350 nucleotides upstream of TSS
> chromEnd is 150 nucleotides downstream of TSS
> name is Name of gene or transcript_id depending on the database.
>
>
>
> chrom    chromStart    chromEnd    name        score    strand
> chr1    67051159    67163158    NM_024763    0    -
> chr1    67075869    67163158    NM_207014    0    +
> chr1    16762998    16812569    NM_017940    0    -
>
> Thanks
> Joseph
>
>
>
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>