[BioC] Human genomic sequences

Vincent Carey stvjc at channing.harvard.edu
Mon Oct 18 23:26:51 CEST 2010


On Fri, Oct 15, 2010 at 2:48 PM, Sheena Scroggins
<sheena.scroggins at gmail.com> wrote:
> Hi,
>
>
> I'm working on a project that is reviewing genes in the human genome
> project. I'm hoping you can point me to some BioConductor packages that will
> help me in my quest. The goal is:
>
>
>   - Search through hg19 at UCSC (or other genome database) for particular
>   genes of interest.

org.Hs.eg.db can be used.  if you are interested in, say, FLT1, you determine
aspects of its location via

> library(org.Hs.eg.db)
> get("FLT1", revmap(org.Hs.egSYMBOL))
[1] "2321"
> get("2321", org.Hs.egCHRLOC)
       13        13        13        13
-28973181 -28959687 -28942233 -28874482
> get("2321", org.Hs.egCHRLOCEND)
       13        13        13        13
-29069265 -29069265 -29069265 -29069265

The multiplicities of addresses are common, and you will need rules to resolve.
If you wish to work at the transcript level, the GenomicFeatures
package is relevant;
makeTranscriptDbFromUCSC is a relevant function.

>   - Increase the search by looking both upstream and downstream until
>   another gene is hit

This is a programming task.

>   - Be able to sort by the conservation of the upstream and downstream
>   data.

You can use rtracklayer to import conservation scores into R.  Example:

> library(rtracklayer)
> s2 = browserSession("UCSC")
> ct = track(s2, "cons46way")
> ct
UCSC track 'Primate Cons'
UCSCData with 9974 rows and 1 value column across 1 space
           space               ranges   |     score
     <character>            <IRanges>   | <numeric>
1          chr21 [33031597, 33031597]   |  0.560087
2          chr21 [33031598, 33031598]   |  0.560087
3          chr21 [33031599, 33031599]   |  0.435717
4          chr21 [33031600, 33031600]   |  0.435717
5          chr21 [33031601, 33031601]   |  0.560087
6          chr21 [33031602, 33031602]   |  0.560087
7          chr21 [33031603, 33031603]   |  0.435717
8          chr21 [33031604, 33031604]   |  0.435717
9          chr21 [33031605, 33031605]   |  0.560087
...          ...                  ... ...       ...
9966       chr21 [33041562, 33041562]   | -0.266457
9967       chr21 [33041563, 33041563]   |  0.433850
9968       chr21 [33041564, 33041564]   |  0.507567
9969       chr21 [33041565, 33041565]   |  0.655000
9970       chr21 [33041566, 33041566]   | -0.155882
9971       chr21 [33041567, 33041567]   | -0.340173
9972       chr21 [33041568, 33041568]   |  0.507567
9973       chr21 [33041569, 33041569]   | -1.740790
9974       chr21 [33041570, 33041570]   | -1.925080

> length(ct$score)
[1] 9974
> summary(ct$score)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
-4.62200 -0.33040  0.36470  0.03753  0.50520  0.65500

what is returned at this point depends on the state of the browser
which you can set manually or programmatically; see the rtracklayer
vignette.

biomaRt package is also relevant.




>
> We are essentially looking for unknown promoters and other key pieces of the
> DNA that is not in the gene itself, but is conserved through the different
> Mammal genomes that are available at the different browsers. I found a
> program that does this (kind of ) with BLAST data, so I'm hoping you can
> point me to any useful package or other places I can search for a way to do
> this.
>
> Thanks for your time,
>
> Sheena
>
>        [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list