[BioC] converting position from '-' strand to '+' strand
Hervé Pagès
hpages at fhcrc.org
Mon Apr 5 21:57:37 CEST 2010
Hi Tim,
Tim Smith wrote:
> Apologies if this seems like a trivial question.
>
> I wanted to have a consistent set of locations and wanted to all the locations to begin from the 5' end. How can I convert locations that are given for the '-' strand?
> For example:
> -----------------------------
> library(biomaRt)
>
> mart.obj <- useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl')
>
> atb <- c('ensembl_gene_id', 'chromosome_name', 'start_position', 'end_position', 'strand')
>
> mir.locs <- getBM(attributes=atb, filters="biotype", values="miRNA", mart=mart.obj)
> mir.locs[1:5,]
>> ensembl_gene_id chromosome_name start_position end_position strand
> 1 ENSG00000222732 5 171706206 171706319 1
> 2 ENSG00000207864 9 97847727 97847823 1
> 3 ENSG00000221173 9 129338809 129338909 -1
> 4 ENSG00000222961 5 32379501 32379581 -1
> 5 ENSG00000221058 18 51612956 51613026 -1
>
>
> ----------------------------
> Is there a quick way that I can convert the last 3 rows so that they reflect positions from the 5' strand?
> many thanks!
Note that Ensembl here is using the widely adopted convention (at
UCSC, Ensembl, in GFF files, in Bioconductor, etc...) to report
feature coordinates relatively to the 5' end of the plus strand
(and to call "start position" the position of their leftmost base)
even for features that belong to the minus strand. There are a lot
of advantages to keep that representation all the way down your
analysis.
Anyway, if you really want to do that conversion, then you need to
know the chromosome lengths for the current Hsapiens genome at
Ensembl. You can easily get them with the devel version of the
GenomicFeatures package (you need R-2.11 + BioC devel):
Make a TranscriptDb object from the 'hsapiens_gene_ensembl' dataset:
library(GenomicFeatures)
txdb <- makeTranscriptDbFromBiomart()
Then:
> seqlengths(txdb)[1:6]
5 19 10 4 8 20
180915260 59128983 135534747 191154276 146364022 63025520
Then use something like (untested):
idx <- which(mir.locs$strand == -1)
idx2seqlengths <- seqlengths(txdb)[mir.locs$chromosome_name[idx]]
tmp <- mir.locs$start_position[idx]
mir.locs$start_position[idx] <-
idx2seqlengths - mir.locs$end_position[idx] + 1L
mir.locs$end_position[idx] <- idx2seqlengths - tmp + 1L
Cheers,
H.
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list