[BioC] bed file around the TSSs
Vincent Carey
stvjc at channing.harvard.edu
Sat Oct 16 23:49:32 CEST 2010
as usual my suggestions were a bit more involved than necessary. the
point is that the thing you get from the
transcripts() method applied to a makeTranscriptsDb will have
"tx_name" as the name of the elementMetadata and
this metadata will not be propagated by export(); if it had name
"name", it would be propagated. so
> f19 = loadFeatures("hg19.txdb.sqlite") # serialized via saveFeatures, but perhaps a long time ago... so the message:
The TranscriptDb object has been updated to the latest schema version 1.0.
The updated object can be saved using saveFeatures()
> t19 = transcripts(f19) # get transcripts GRanges
> names(elementMetadata(t19)) # what are the metadata names
[1] "tx_id" "tx_name"
> names(elementMetadata(t19))[2] = "name" # reset a name
> export(t19, "t19.bed") # do not use file=
> cat(readLines("t19.bed", n=5), sep="\n") # what you asked for, minus the header line
chr1 11873 14409 uc001aaa.3 0 +
chr1 11873 14409 uc010nxq.1 0 +
chr1 11873 14409 uc010nxr.1 0 +
chr1 69090 70008 uc001aal.1 0 +
chr1 321083 321114 uc001aaq.1 0 +
On Sat, Oct 16, 2010 at 5:10 PM, joseph <jdsandjd at yahoo.com> wrote:
> Hi Vincent
> Thank you, that was very helpful.
> Can you please expand a bit more on step#3 about coercing to RangedData and
> changing tx_name to name?
> Joseph
>
> ________________________________
> From: Vincent Carey <stvjc at channing.harvard.edu>
> To: joseph <jdsandjd at yahoo.com>
> Cc: bioconductor at stat.math.ethz.ch
> Sent: Wed, October 13, 2010 2:05:31 PM
> Subject: Re: [BioC] bed file around the TSSs
>
> There are various approaches but basic tools for one solution are
> 1) use GenomicFeatures::makeTranscriptTableFromUCSC(genome, table) to
> acquire
> a SQLite database that has relevant information
> 2) extract the GRanges object corresponding to transcripts and their
> genomic coordinates
> using the transcripts() method
> 3) set the names of the GRanges using something like names(gr) =
> elementMetadata(gr)$tx_name -- but see below; you might instead coerce
> to RangedData and just change the column names
> 4) revise the coordinate information according to your wishes
> 5) use the rtracklayer export method on the revised GRanges or
> RangedData structure to create the bed file
>
> some concrete results, based on
>
>> sessionInfo()
> R version 2.13.0 Under development (unstable) (2010-10-12 r53293)
> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats graphics grDevices datasets tools utils methods
> [8] base
>
> other attached packages:
> [1] rtracklayer_1.9.9 RCurl_1.4-3 bitops_1.0-4.1
> [4] GenomicFeatures_1.1.12 GenomicRanges_1.1.29 IRanges_1.7.35
> [7] weaver_1.15.0 codetools_0.2-2 digest_0.4.2
>
> loaded via a namespace (and not attached):
> [1] BSgenome_1.17.7 Biobase_2.9.2 Biostrings_2.17.47 DBI_0.2-5
> [5] RSQLite_0.9-2 XML_3.1-1 biomaRt_2.5.1
>
>> t1 = makeTranscriptDbFromUCSC("hg19", "refGene")
> Download the refGene table ... OK
> Download the refLink table ... OK
> Extract the 'transcripts' data frame ... OK
> Extract the 'splicings' data frame ... OK
> Download and preprocess the 'chrominfo' data frame ... OK
> Prepare the 'metadata' data frame ... OK
> Make the TranscriptDb object ... OK
> There were 50 or more warnings (use warnings() to see the first 50)
>> tt1 = transcripts(t1)
>> tt1
> GRanges with 37195 ranges and 2 elementMetadata values
> seqnames ranges strand | tx_id tx_name
> <Rle> <IRanges> <Rle> | <integer> <character>
> [1] chr1 [ 69091, 70008] + | 980 NM_001005484
> [2] chr1 [323892, 328581] + | 985 NR_028327
> [3] chr1 [323892, 328581] + | 986 NR_028322
> [4] chr1 [323892, 328581] + | 987 NR_028325
> [5] chr1 [367659, 368597] + | 984 NM_001005277
>
>> names(tt1) = make.names(elementMetadata(tt1)$tx_name, unique=TRUE) # why?
>> tt1[1:5]
> GRanges with 5 ranges and 2 elementMetadata values
> seqnames ranges strand | tx_id tx_name
> <Rle> <IRanges> <Rle> | <integer> <character>
> NM_001005484 chr1 [ 69091, 70008] + | 980 NM_001005484
> NR_028327 chr1 [323892, 328581] + | 985 NR_028327
> NR_028322 chr1 [323892, 328581] + | 986 NR_028322
> NR_028325 chr1 [323892, 328581] + | 987 NR_028325
> NM_001005277 chr1 [367659, 368597] + | 984 NM_001005277
>
> seqlengths
> chr1 chr2 ... chr18_gl000207_random
> 249250621 243199373 ... 4262
>
> observe what happens when the tx_names are duplicated
>
>> tt1[42:44]
> GRanges with 3 ranges and 2 elementMetadata values
> seqnames ranges strand | tx_id tx_name
> <Rle> <IRanges> <Rle> | <integer> <character>
> NR_002946 chr1 [1568159, 1570027] + | 1053 NR_002946
> NR_002946.1 chr1 [1631378, 1633247] + | 1063 NR_002946
> NM_138705 chr1 [1846266, 1848733] + | 1066 NM_138705
>
> seqlengths
> chr1 chr2 ... chr18_gl000207_random
> 249250621 243199373 ... 4262
>
> if you coerce to RangedData and change tx_name to name you can
> probably avoid the munging of the transcript names,
> which in this case, if you use the approach above, could be pretty
> confusing if .1 etc are used to indicate versions at refseq
>
>> export(tt1, "tt1.bed")
>> cat(readLines("tt1.bed", n=5), sep="\n")
> chr1 69090 70008 NM_001005484 0 +
> chr1 323891 328581 NR_028327 0 +
> chr1 323891 328581 NR_028322 0 +
> chr1 323891 328581 NR_028325 0 +
> chr1 367658 368597 NM_001005277 0 +
>
> i have skipped the modifications to the coordinates (you will need to
> program
> that conditional on strand, presumably) and the setting of the header line.
>
>
>
>
> On Wed, Oct 13, 2010 at 3:53 PM, joseph <jdsandjd at yahoo.com> wrote:
>> Hi
>> I would highly appreciate if you could show me how to create a bed file
>> around
>> the TSSs from UCSC databases such as ensembl or refSeq genes. I need 350
>> nucleotides upstream and 150 nucleotides downstream of TSSs. The bed file
>> should
>> look like below, where:
>> chromStart is 350 nucleotides upstream of TSS
>> chromEnd is 150 nucleotides downstream of TSS
>> name is Name of gene or transcript_id depending on the database.
>>
>>
>>
>> chrom chromStart chromEnd name score strand
>> chr1 67051159 67163158 NM_024763 0 -
>> chr1 67075869 67163158 NM_207014 0 +
>> chr1 16762998 16812569 NM_017940 0 -
>>
>> Thanks
>> Joseph
>>
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>
More information about the Bioconductor
mailing list