[BioC] biomaRt: connection stopping
Steffen Durinck
durincks at mail.nih.gov
Wed Sep 13 18:19:15 CEST 2006
Hi,
I would like to add that biomaRt in RCurl mode can handle big queries
but will break when you use it in a big loop.
An alternative to what Jim suggests could be to do the query for all ids
at once:
A<-getBM(attributes=c("hgnc_symbol","refseq_dna"),mart=mart,filters="refseq_dna",values=RS)
By adding refseq_dna as an attribute, HUGO symbols and RefSeq identifiers will be automatically matched up in A. If needed, you can loop over the result in A and you avoid doing 18000+ separate database queries so it will be faster.
best,
Steffen
James W. MacDonald wrote:
> J.delasHeras at ed.ac.uk wrote:
>
>> Hi,
>>
>> I suspect this is something to do purely with my connection, but I
>> thought I'd ask, just in case:
>>
>> I have a list of refseq ids (NM_xxxxx), 18028 of them.
>> I wanted to get the gene symbols for those genes, so I used biomaRt on
>> the whole list. What I got was a single column data frame longer than
>> 18028, as I get multiple results with some of these refseq ids. There
>> doesn't seem to be an easy way to regroup them together, so I do the
>> following instead:
>>
>
> Using the RCurl interface for a big query like that isn't ideal. You
> would be better off installing RMySQL and using the MySQL interface
> (note: you can get RMySQL using biocLite(), thanks to the fine folks in
> Seattle). Also, you can have getBM() put things in a list, so any
> duplicated gene symbols will be grouped together.
>
> A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output =
> "list", mysql = TRUE)
>
> Should do the trick.
>
> HTH,
>
> Jim
>
>
>
>> #create an empty list of teh right length
>> A<-vector(mode="list", length=18028)
>> #now loop filling elements of the list from the biomaRt queries
>> for (i in 1:18028){
>> K<-i
>> A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refseq_dna",values=c(RS[i]))
>> }
>> print(K)
>>
>> RS is a vector containing the 18028 refseq ids.
>> the K value is only so that I know where it breaks... because that's
>> what happens... after a while, it breaks with an error message:
>>
>> Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) :
>> couldn't connect to host
>>
>> This doesn't happen if I send the whole query in ONE go, in a vector...
>> but if I do it element by element it breaks after 3-4000 queries.
>> Any ideas to do this in a simpler/better way? Or at least one that
>> doesn't have me coming back to re-start the loop at the position of the
>> last break?
>>
>> thanks!
>>
>> Jose
>>
>>
>
>
>
--
Steffen Durinck, Ph.D.
Oncogenomics Section
Pediatric Oncology Branch
National Cancer Institute, National Institutes of Health
URL: http://home.ccr.cancer.gov/oncology/oncogenomics/
Phone: 301-402-8103
Address:
Advanced Technology Center,
8717 Grovemont Circle
Gaithersburg, MD 20877
More information about the Bioconductor
mailing list