[BioC] biomaRt: connection stopping
J.delasHeras at ed.ac.uk
J.delasHeras at ed.ac.uk
Wed Sep 13 19:43:37 CEST 2006
Great suggestions!
Thanks!
adding refseq as attributes too, why didn't I think of that? :-)
Jose
Quoting Steffen Durinck <durincks at mail.nih.gov>:
> Hi,
>
> I would like to add that biomaRt in RCurl mode can handle big queries
> but will break when you use it in a big loop.
> An alternative to what Jim suggests could be to do the query for all ids
> at once:
>
> A<-getBM(attributes=c("hgnc_symbol","refseq_dna"),mart=mart,filters="refseq_dna",values=RS)
>
> By adding refseq_dna as an attribute, HUGO symbols and RefSeq
> identifiers will be automatically matched up in A. If needed, you
> can loop over the result in A and you avoid doing 18000+ separate
> database queries so it will be faster.
>
> best,
> Steffen
>
>
>
>
> James W. MacDonald wrote:
>> J.delasHeras at ed.ac.uk wrote:
>>
>>> Hi,
>>>
>>> I suspect this is something to do purely with my connection, but I
>>> thought I'd ask, just in case:
>>>
>>> I have a list of refseq ids (NM_xxxxx), 18028 of them.
>>> I wanted to get the gene symbols for those genes, so I used biomaRt
>>> on the whole list. What I got was a single column data frame longer
>>> than 18028, as I get multiple results with some of these refseq
>>> ids. There doesn't seem to be an easy way to regroup them together,
>>> so I do the following instead:
>>>
>>
>> Using the RCurl interface for a big query like that isn't ideal. You
>> would be better off installing RMySQL and using the MySQL interface
>> (note: you can get RMySQL using biocLite(), thanks to the fine folks
>> in Seattle). Also, you can have getBM() put things in a list, so any
>> duplicated gene symbols will be grouped together.
>>
>> A <- getBM("hgnc_symbol", "refseq_dna", RS, mart = mart, output =
>> "list", mysql = TRUE)
>>
>> Should do the trick.
>>
>> HTH,
>>
>> Jim
>>
>>
>>
>>> #create an empty list of teh right length
>>> A<-vector(mode="list", length=18028)
>>> #now loop filling elements of the list from the biomaRt queries
>>> for (i in 1:18028){
>>> K<-i
>>> A[[i]]<-getBM(attributes=c("hgnc_symbol"),mart=mart,filters="refseq_dna",values=c(RS[i]))
>>> }
>>> print(K)
>>>
>>> RS is a vector containing the 18028 refseq ids.
>>> the K value is only so that I know where it breaks... because
>>> that's what happens... after a while, it breaks with an error
>>> message:
>>>
>>> Error in postForm(paste(mart at host, "?", sep = ""), query = xmlQuery) :
>>> couldn't connect to host
>>>
>>> This doesn't happen if I send the whole query in ONE go, in a
>>> vector... but if I do it element by element it breaks after 3-4000
>>> queries.
>>> Any ideas to do this in a simpler/better way? Or at least one that
>>> doesn't have me coming back to re-start the loop at the position of
>>> the last break?
>>>
>>> thanks!
>>>
>>> Jose
>>>
>>>
>>
>>
>>
>
>
> --
> Steffen Durinck, Ph.D.
>
> Oncogenomics Section
> Pediatric Oncology Branch
> National Cancer Institute, National Institutes of Health
> URL: http://home.ccr.cancer.gov/oncology/oncogenomics/
>
> Phone: 301-402-8103
> Address:
> Advanced Technology Center,
> 8717 Grovemont Circle
> Gaithersburg, MD 20877
>
--
Dr. Jose I. de las Heras Email: J.delasHeras at ed.ac.uk
The Wellcome Trust Centre for Cell Biology Phone: +44 (0)131 6513374
Institute for Cell & Molecular Biology Fax: +44 (0)131 6507360
Swann Building, Mayfield Road
University of Edinburgh
Edinburgh EH9 3JR
UK
More information about the Bioconductor
mailing list