[R] RE: R performance questions

Fri Dec 5 17:48:39 CET 2003

Dear Michael

You rise a very good question. The number of microarray data is ever
increasing and dealing with 10,000 .cel files is quite challenging.

R and Bioconductor are great for developing and testing novel
algorithms, however, personally, I do not think that R will ever
be able to deal with massive amounts of data. 10,000 .cel files using
the newest GeneChips are equivalent to more than 200 Gigabyte of data,
so we are eventually talking about data in the Terabyte range.

Maybe, it is time to look how scientists used to handle large data
deal with this problem, such as the high energy physicists. Having
done this, I have decided to start to write my own expression analyisis
program which is no longer based on R but on C++ using a framework,
called ROOT, which is currently under development at CERN to deal
with Petabytes (!!) of data, see:
http://www.ci.tuwien.ac.at/Conferences/DSC-2003/Proceedings/Stratowa.pdf

Sorrowly, it takes me longer than expected to develop this software,
but you are looking ahead two or three years anyhow :-)

If microarray data would be stored in the way described, i.e. in the
same way as high energy physics data, this would already be a step
in the right direction.

However, this is only my personal opinion. In our company I still
use mainly R to analyse our microraay data.

Best regards
Christian Stratowa
Vienna     Austria

Michael Benjamin wrote:
> Hi--
> 
> While I agree that we cannot agree on the ideal algorithms, we should be
> taking practical steps to implement microarrays in the clinic.  I think
> we can all agree that our algorithms have some degree of efficacy over
> and above conventional diagnostic techniques.  If patients are dying
> from lack of diagnostic accuracy, I think we have to work hard to use
> this technology to help them, if we can.  I think we can, even now.
> 
> What if I offer, in my clinic, a service for cancer patients to compare
> their affy data to an existing set of data, to predict their prognosis
> or response to chemotherapy?  I think people will line up out the door
> for such a service.  Knowing what we as a group of array analyzers know,
> wouldn't we all want this kind of service available if we or a loved one
> got cancer?
> 
> Can our programs deal with 1,000 .cel files?  10,000 files?  
> 
> I think our programs are pretty good, but what we need is DATA.  We must
> be careful what we wish for--we might get it!  So how do we measure
> whether analyzing 10,000 .cel files with library(affy) is feasible?  I'm
> assuming that advanced hardware would be required for such a task.  What
> are the critical components of such a platform?  How much money would a
> feasible system for array analysis cost?
> 
> I was just looking ahead two or three years--where is all this genomic
> array research headed?  I guess I'm concerned about scalability.  
> 
> Is anyone really working on implementing affy on a cluster/Beowulf?
> That sounds like a real challenge.
> 
> Regards,
> Michael Benjamin, MD
> -----Original Message-----
> From: Liaw, Andy [mailto:andy_liaw at merck.com] 
> Sent: Wednesday, December 03, 2003 9:47 PM
> To: 'Michael Benjamin'
> Subject: RE: [BioC] R performance questions
> 
> Another point about benchmarking:  As has been discussed on R-help
> before,
> benchmarks can be misleading, as the one you mentioned.  It measures
> linear
> algebra tasks, etc., but that typically account for very small portion
> of
> "average" tasks.  Doug Bates also pointed out that the eigen() example
> used
> in that benchmark is computing mostly meaningless results.
> 
> In our experience, learning to use R more efficiently gives us the most
> mileage, but large and fast hardware wouldn't hurt...
> 
> Cheers,
> Andy
> 
> 
>>-----Original Message-----
>>From: Michael Benjamin [mailto:msb1129 at bellsouth.net] 
>>Sent: Wednesday, December 03, 2003 7:32 PM
>>To: 'Liaw, Andy'
>>Subject: RE: [BioC] R performance questions
>>
>>
>>Thanks.
>>Mike
>>
>>-----Original Message-----
>>From: Liaw, Andy [mailto:andy_liaw at merck.com] 
>>Sent: Wednesday, December 03, 2003 8:17 AM
>>To: 'Michael Benjamin'
>>Subject: RE: [BioC] R performance questions
>>
>>Hi Michael,
>>
>>Just one comment about SVM.  If you use the svm() function in 
>>the e1071
>>package to train linear SVM, it will be rather slow.  That's a known
>>limitation of libsvm, of which the svm() function uses.  If you are
>>willing
>>to go outside of R, the "bsvm" package by C.J. Lin (same person who
>>wrote
>>libsvm) will train linear svm in much more efficient manner.
>>
>>HTH,
>>Andy
>>
>>
>>>-----Original Message-----
>>>From: bioconductor-bounces at stat.math.ethz.ch 
>>>[mailto:bioconductor-bounces at stat.math.ethz.ch] On Behalf Of 
>>>Michael Benjamin
>>>Sent: Tuesday, December 02, 2003 10:30 PM
>>>To: bioconductor at stat.math.ethz.ch
>>>Subject: [BioC] R performance questions
>>>
>>>
>>>Hi, all--
>>>
>>>I wanted to start a thread on R speed/benchmarking.  There 
>>
>>is a nice R
>>
>>>benchmarking overview at 
>>
>>http://www.sciviews.org/other/benchmark.htm,
>>
>>>along with a 
>>
>>free script so you can see how your machine stacks up.
>>
>>>Looks like R is substantially faster than S-plus.
>>>
>>>My problem is this: with 512Mb and an overclocked AMD 
>>
>>Athlon XP 1800+,
>>
>>>running at 588 SPEC-FP 2000, it still takes FOREVER to 
>>>analyze multiple
>>>.cel files using affy (expresso).  Running svm takes a mighty 
>>>long time
>>>with more than 500 genes, 150 samples.
>>>
>>>Questions:
>>>1) Would adding RAM or processing speed improve performance 
>>
>>the most?
>>
>>>2) Is it possible to run R on a cluster without rewriting my 
>>>high-level
>>>code?  In other words,
>>>3) What are we going to do when we start collecting 
>>
>>terabytes of array
>>
>>>data to analyze?  There will come a "breaking point" at 
>>
>>which desktop
>>
>>>systems can't perform these analyses fast enough for large 
>>>quantities of
>>>data.  What then?
>>>
>>>Michael Benjamin, MD
>>>Winship Cancer Institute
>>>Emory University,
>>>Atlanta, GA
>>>
>>>_______________________________________________
>>>Bioconductor mailing list
>>>Bioconductor at stat.math.ethz.ch
>>>https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor
>>>
>>
>>
>>
>>
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> 
>