[BioC] biomaRt queries: optimal size?
Wolfgang Huber
whuber at embl.de
Tue Dec 22 14:04:59 CET 2009
Hola José
sorry for the name confusion. The way that BioMart presents many-to-one
relationships (producing one single big table with all queried
attributes, and possibly lots of repetitions in some columns) can be
very space-inefficient. This is the price that that system's design pays
for the simplicity.
Anyway, I don't think it should return table rows that are completely
identical - if you (or someone else here) comes across such an
instance, then please report that on this list!
Best wishes
Wolfgang
PS Do you know the way to San ... :)
J.delasHeras at ed.ac.uk scripsit 12/21/2009 07:03 PM:
> Quoting Wolfgang Huber <whuber at embl.de>:
>
>>
>> Dear Javier
>>
>> Try there:
>>
>> 1. Set
>> options(error=recover)
>> and then use the 'post mortem' debugger to see why postRes (a character
>> string) is so large. Let us know what you find!
>>
>> 2. Rather than splitting up the query genes, you could split up the
>> attributes, and only ask for a few at a time, and/or see which one
>> causes the large size of the result
>>
>> 3. Send us a reproducible example (i.e. one that others can reproduce
>> by copy-pasting from your email).
>>
>> Best wishes
>> Wolfgang
>
>
> "My name is not Javier!!!"
>
> (you had to be in Spain in the 80s to get the joke... nevermind, it was
> a silly pop song ;-)
>
> Thank you for the suggestions. I managed to finish what I was doing
> (breaking the query into chunks of 200ids at a time) but I have some
> more searches coming and will definitely use a different approach, and
> try the options(error=recover) method to investigate if I have problems.
>
> My query, as you suggest above, would be better performed by using less
> attributes, rather than splitting the ids. I just didn't have enough
> experience in this. When using multiple attributes, the resulting data
> frame may contain quite a few more rows of data, if there are multiple
> values for some of teh attributes... and this happens a lot when looking
> at gene ontologies.
> I may have started with a 1545 id vector, but ended up with a data frame
> containing nearly 4 million rows! (assembled from 8 individual queries
> of ~200 ids at a time) I will definitely not do it again this way!
> Much better to pick less attributes and then process the data, and then
> I'll probably be able to process all IDs at once.
>
> Thank you for your help, Wolfgang and Jim.
>
> Jose
>
--
Best wishes
Wolfgang
--
Wolfgang Huber
EMBL
http://www.embl.de/research/units/genome_biology/huber/contact
More information about the Bioconductor
mailing list