[BioC] Genome position to miRNA or gene name
Martin Morgan
mtmorgan at fhcrc.org
Thu Jan 22 06:27:39 CET 2009
Steve Lianoglou <mailinglist.honeypot at gmail.com> writes:
> Hi,
>
>> I have got a set of human genome locations that I have found using
>> Biostrings and BSGenome alignment e.g.
>>
>> seqname start end strand patternID
>> chr9 95978065 95978085 + TGAGGTAGTAGGTTGTATAGT
>> chr11 121522487 121522507 - TGAGGTAGTAGGTTGTATAGT
>> chr22 44887296 44887316 + TGAGGTAGTAGGTTGTATAGT
>> chr22 44888235 44888256 + TGAGGTAGTAGGTTGTGTGGTT
>>
>> What I would like to know is whether this genome location is within a
>> known miRNA or gene. What would the best way be to go about this?
>
>
> One way could be to grab the appropriate GTF file for Hsapiens here:
>
> ftp://ftp.ensembl.org/pub/current_gtf/
>
> It's just a tab delimited file with genome annotations. You can just
> collect the lines for miRNA annotations and see if your positions fall
> in the bounds of the known/annotated miRNA's.
>
> For example, here's one:
>
> 18 miRNA exon 38162 38272 . + . gene_id
> "ENSG00000221441" ...
>
> You can also get miRNA data from miRBase
> (http://microrna.sanger.ac.uk/sequences/ ) perhaps it's a more
> complete set of data?). The data file I have from them doesn't have
> chromosome positions, but does have the stem-
> loop sequence, so that would require an intermediate alignment step
> before getting at what you're after.
>
> Probably not the best way, but a way nonetheless. If you're
> comfortable using biomaRt, I'm sure there's a way to pull down the
> same annotation info and do the comparison from there.
>
> Sorry ... no R code, but hopefully it's helpful nonetheless :-)
Very roughly, this
fl <- "ftp://ftp.sanger.ac.uk/pub/mirbase/sequences/CURRENT/genomes/hsa.gff"
seems to be the Human miRNA data base from the Sanger. It parses in to
an R data frame with
gff <- read.table(fl)
(hey, that's cool -- pulling the data directly from ftp!). The fourth
and fifth columns are chromosomal positions; there is also information
on chromosome (column 1) and strand (column 7).
My strategy would be to loop over your aligned sequences by chromosome
and strand, and for each subset construct an IRanges object (from the
IRanges package) that contains the start and end position of all
sequenes. Suppose we have the 'start' and 'end' of each sequence on
chromosome 1
seqs <- IRanges(start, end)
and the same for the gff data
miRNAs <- with(gff[gff$V1 == "1" & gff$V7 == "+",], IRanges(V4, V5))
Then use 'overlap' from IRanges. Along the lines of
overlap(miRNAs, seqs, multiple=FALSE)
A little messy to keep track of everything, but should be
computationally quite efficient, even for very large numbers of
sequences. I think also that import.gff (in rtracklayer) and the
alignment data structures in Biostrings are all based on the same
classes as the IRanges, and that with a little bit of cleverness the
software might do all the book-keeping for you.
Martin
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Physiology, Biophysics and Systems Biology
> Weill Medical College of Cornell University
>
> http://cbio.mskcc.org/~lianos
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M2 B169
Phone: (206) 667-2793
More information about the Bioconductor
mailing list