[BioC] biomaRt - batch query for chromosome location to gene identifier?

Tue Nov 16 03:55:49 CET 2010

Dear Vincent,

your solution works (almost) perfectly (see below ;-) ). In fact, I don't need the step using R2.10.1, because with the UCSC IDs from your solution, I can retrieve the Entrez gene IDs via biomaRt, or do my analysis with the other identifier. Thank you again!

Dear Martin,

I am not quite sure, about the advantage of your suggestion? Besides of the cleaner syntax (and probably higher efficiency?) the output looks "promising" as it seems, that the it tells me how all of my "query" ranges (= locations) are assigned to a "subject" range in a 1:n way. That would be very nice, but I don't get the content of the "subject" column? I should be a UCSC ID, but I couldn't figure out which (obviously, not the transcript ID)? 

> mygene.overlap = as.matrix(findOverlaps(mygene.granges,tx18))
> mygene.overlap
       query subject
  [1,]     1    6242
  [2,]     2     821
  [3,]     2     822
  [4,]     2     823
......
[429,]   160   64665
[430,]   160   64666
[431,]   161   64793
[432,]   161   64794

Related to the question above (and the only point that could be "smoothened" ;-) ): is there a way to have my cluster names returned in the results, too, so that I could at a first glimpse see which cluster got which annotation? I tried two positions for my cluster names in the GRanges object: 1) hijacked the metadata column, 2) discovered the name option. But I did not find a possibility to return them side by side with the results in the GenomicFeatures/Ranges, and IRanges vignettes.

Best regards,
Kemal

Structure of data table "data.table" imported into R (I truncated it and that of the output below!):

clusterID chromosome strand cluster_begin cluster_end
seq1_chr1 chr1 - 103949843 103949866
seq2_chr1 chr1 + 19566219 19566244
seq3_chr1 chr1 - 223341511 223341535
seq4_chr1 chr1 - 3341179  3341202
...
seq161_chr20 chr20 + 2399999 2400010

R session (truncated to save space)

> library("GenomicFeatures")
Loading required package: IRanges

Attaching package: 'IRanges'

The following object(s) are masked from 'package:base':

    cbind, eval, Map, mapply, order, paste, pmax, pmax.int, pmin, pmin.int, rbind, rep.int, table

Loading required package: GenomicRanges
> library("biomaRt")
> hg18.txdb = loadFeatures("path/to/transcripts/hg18.txdb.sqlite")
> tx18 = transcripts(hg18.txdb)
> mygene.import = read.table("/path/to/my/table/data.table.txt", sep = "", h = TRUE, stringsAsFactors = FALSE)
> mygene.granges = with(mygene.import, GRanges(ranges = IRanges(cluster_begin, cluster_end), strand = Rle(factor(strand)), seqnames = Rle(factor(chromosome)), elementMetadata = clusterID))
> mygene.granges

GRanges with 161 ranges and 1 elementMetadata value
      seqnames                 ranges strand   | elementMetadata
         <Rle>              <IRanges>  <Rle>   |     <character>
  [1]     chr1 [103949843, 103949866]      -   |    seq1_chr1
  [2]     chr1 [ 19566219,  19566244]      +   |     seq2_chr1
  [3]     chr1 [223341511, 233341535]      -   |    seq3_chr1
  [4]     chr1 [3341179,     3341202]      -   |    seq4_chr1
  ...      ...                    ...    ... ...             ...
[161]     chr20 [2399999, 2400010]          +   |    seq161_chr20

seqlengths
  chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr19  chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA

>mygene.granges2 = with(mygene.import, GRanges(ranges = IRanges(cluster_begin, cluster_end, names = clusterID), strand = Rle(factor(strand)), seqnames = Rle(factor(chromosome))))
> mygene.granges2
GRanges with 161 ranges and 0 elementMetadata values
             seqnames                 ranges strand   |
                <Rle>              <IRanges>  <Rle>   |
seq1_chr1     chr1 [103949843, 103949866]      -   |
seq2_chr1     chr1 [ 19566219,  19566244]      +   |
seq3_chr1     chr1 [223341511, 233341535]      -   |
seq4_chr1     chr1 [3341179,     3341202]      -   |
         ...      ...                    ...    ... ...
seq161_chr20    chr20 [2399999, 2400010]      +   |

seqlengths
  chr1 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr19  chr2 chr20 chr21 chr22  chr3  chr4  chr5  chr6  chr7  chr8  chr9  chrX
    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
> sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] biomaRt_2.6.0         GenomicFeatures_1.2.1 GenomicRanges_1.2.1   IRanges_1.8.2        

loaded via a namespace (and not attached):
[1] Biobase_2.10.0     Biostrings_2.18.0  BSgenome_1.18.1    DBI_0.2-5          RCurl_1.4-3        RSQLite_0.9-3      rtracklayer_1.10.5 XML_3.2-0         
>

On Nov 12, 2010, at 7:18 AM, Vincent Carey wrote:

> Here is a solution "entirely in R" provided you have acquired the
> hg18 transcript database using GenomicFeatures makeTranscriptDbFromUCSC("hg18")
> and run saveFeatures on the result to yield file "hg18.txdb.sqlite":
> 
> I put your data in space-delimited text, then transformed to GRanges preserving
> chromosome and strand, then find overlaps of your regions with transcripts
> for hg18.  The names of these transcripts are given in the UCSC 'knownGene'
> vocabulary.  After the session info, we use R 2.10 org.Hs.eg.db to resolve these
> to (I think) hg18 entrez ids (we only have hg19 mappings UCSCKG<->Entrez for
> current R, to the best of my knowledge, and I believe that even these
> are deprecated).
> 
>> kemdat = read.table("kemdat.txt", sep=" ", h=TRUE,stringsAsFactors=FALSE)
>> library(GenomicRanges)
>> kem = with(kemdat, GRanges(ranges=IRanges(Cluster_Begin, Cluster_End), strand=
> +  Rle(factor(Strand)), seqnames=Rle(factor(Chromosome))))
>> kem
> GRanges with 5 ranges and 0 elementMetadata values
>    seqnames                 ranges strand |
>       <Rle>              <IRanges>  <Rle> |
> [1]     chr2 [ 74295624,  74295644]      + |
> [2]     chr1 [203949843, 203949866]      - |
> [3]     chr8 [103464182, 103464207]      - |
> [4]    chr17 [ 53272097,  53272117]      - |
> [5]    chr12 [ 11150173,  11150194]      - |
> 
> seqlengths
>  chr1 chr12 chr17  chr2  chr8
>    NA    NA    NA    NA    NA
>> library(GenomicFeatures)
>> hg18.txdb = loadFeatures("hg18.txdb.sqlite")
>> tx18 = transcripts(hg18.txdb)
>> kg = values(tx18[ findOverlaps(kem,tx18)@matchMatrix[,2] ])$tx_name
>> dput(kg)
> c("uc002skj.1", "uc002skk.1", "uc010ffb.1", "uc001hdb.2", "uc003ykr.1",
> "uc003yks.1", "uc002ivc.1", "uc009zhp.1", "uc001qzb.2", "uc001qzc.2",
> "uc001qze.2", "uc001qzf.1", "uc001qzj.2")
>> sessionInfo()
> R version 2.13.0 Under development (unstable) (2010-10-29 r53474)
> Platform: x86_64-apple-darwin10.4.0/x86_64 (64-bit)
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> 
> other attached packages:
> [1] GenomicFeatures_1.3.6 GenomicRanges_1.3.2   IRanges_1.9.6
> 
> loaded via a namespace (and not attached):
> [1] BSgenome_1.17.7    Biobase_2.10.0     Biostrings_2.17.47 DBI_0.2-5
> [5] RCurl_1.4-3        RSQLite_0.9-2      XML_3.1-1          biomaRt_2.5.1
> [9] rtracklayer_1.11.3
>> 
> 
> Now use R 2.10.1 and org.Hs.eg.db 2.3.6 (current is 2.4.5 for devel,
> and the UCSCKG
> mapping is deprecated)
> 
>> kgs = c("uc002skj.1", "uc002skk.1", "uc010ffb.1", "uc001hdb.2", "uc003ykr.1",
> + "uc003yks.1", "uc002ivc.1", "uc009zhp.1", "uc001qzb.2", "uc001qzc.2",
> + "uc001qze.2", "uc001qzf.1", "uc001qzj.2")
>> unlist(mget(kgs, revmap(org.Hs.egUCSCKG))
> + )
> uc002skj.1 uc002skk.1 uc010ffb.1 uc001hdb.2 uc003ykr.1 uc003yks.1 uc002ivc.1
>   "10797"    "10797"    "10797"    "64710"    "51366"    "51366"    "51649"
> uc009zhp.1 uc001qzb.2 uc001qzc.2 uc001qze.2 uc001qzf.1 uc001qzj.2
>   "11272"     "5554"     "5554"     "5554"     "5545"     "5554"
>> sessionInfo()
> R version 2.10.1 RC (2009-12-10 r50697)
> i386-apple-darwin9.8.0
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] stats     graphics  grDevices datasets  tools     utils     methods
> [8] base
> 
> other attached packages:
> [1] org.Hs.eg.db_2.3.6  RSQLite_0.7-3       DBI_0.2-5
> [4] AnnotationDbi_1.8.2 Biobase_2.6.1       weaver_1.11.1
> [7] codetools_0.2-2     digest_0.4.1
> 
> 
> 
> On Thu, Nov 11, 2010 at 10:01 PM, Kemal Akat <kakat at mail.rockefeller.edu> wrote:
>> Hi all,
>> 
>> I have a list of mapped sequence reads to hg18 for that I have the exact chromosomal location on NCBI build 36.
>> 
>> Cluster ID              Strand  Chromosome      Cluster_Begin   Cluster_End
>> slc754_chr2             +               chr2            74295624                74295644
>> slc4695_chr1            -               chr1                    203949843               203949866
>> slc2213_chr8            -               chr8            103464182               103464207
>> slc1866_chr17   -               chr17           53272097                53272117
>> slc1642_chr12   -               chr12           11150173                        11150194
>> ...
>> 
>> For the downstream analysis I would like to assign each location an identifier (entrez gene id, ensembl gene id and so forth), and the question is simply if I can use the biomaRt package for this at all?
>> 
>> It is easy for a single entry:
>> 
>>> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(2,74295624, 74295644,1), mart=ensembl54) # ensembl54 is using the archived build 54 = NCBI 36
>>> geneid
>>  entrezgene
>> 1      10797
>>> 
>> 
>> However, so far I have failed to make a batch query out of it.
>> 
>> I imported/created the following 1 column data frame with the localization formatted as necessary
>> 
>>> tdp
>>                                 chromosomal_cocation
>> slc754_chr2              2,74295624,74295644,1
>> slc4695_chr1    1,203949843,203949866,-1
>> slc2213_chr8    8,103464182,103464207,-1
>> slc1866_chr17           17,53272097,53272117,-1
>> slc1642_chr12           12,11150173,11150194,-1
>> ...
>> 
>> I have two points where I failed:
>> 
>> 1) I have not found a single filter that replaces the multiple filters above. When I use "chromosomal_region" as single filter and run:
>> 
>>> geneid <-getBM(attributes="entrezgene", filters="chromosomal_region", values = list(tdp$chromosomal_location, mart=ensembl54)
>> 
>> I get 19733 gene ids; my dataset actually has only 161 locations.
>> 
>> 2) If I use multiple filters like I did above in the first example, "values" has to be a vector and the expression "values = list(tdp$chromosomal_location, mart=ensembl54)" yields a "subscript out of bounds" error.
>> I tried splitting the localization infos into separate vectors, i.e. chromosome <- c(2,1,8,17,12,...), start <- c(74295624,....), end <- c(...), strand <- c(...) and modified my query:
>> 
>>> geneid <-getBM(attributes="entrezgene", filters=c("chromosome_name","start","end", "strand"), values = list(chromosome, start, end, strand, mart=ensembl54)
>> 
>> But this seems to combine the information in the different vectors as the result is over 20.000 entries.
>> 
>> Finally, I was thinking of a loop to complete the task, but this has been discouraged by another post in the mailing archive!?
>> 
>> Any help/idea appreciated!
>> 
>> Thank you,
>> Kemal
>> 
>> Dr. med. Kemal Akat
>> Postdoctoral Fellow
>> Laboratory of RNA Molecular Biology
>> The Rockefeller University
>> 1230 York Avenue, Box #186
>> New York, NY 10065
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>