[BioC] BHC appears to be broken

Mon Apr 29 14:51:36 CEST 2013

Dear Joseph,

I believe the problem is that you're calling 'bhc' incorrectly.  There is a 
legacy value that needs to be entered as the 3rd input, which you're missing 
(see below).  Apologies for the presence of this - it's no longer required by 
the algorithm, but is there to prevent people's scripts crashing that are set up 
to use an early release of the package.  The document examples for 'bhc' have it 
set correctly, but it's not highlighted particularly.

(if we do a version 2.0, I'll sure we'll bite the bullet and remove this)

Here is a modified version of your code that should demonstrate both how to make 
it work, and also how to make it crash.  (tested on R 3.0.0 Snow Leopard build 
for MAC OSX).

library(BHC)
data       <- read.csv("subsample.csv",header=FALSE)
itemLabels <- t(read.csv("~labels.csv", header=FALSE)) #read in and transpose
timePoints <- 1:24                                    #number of timepoints

##THIS ONE WORKS
startTime <- Sys.time()
BHC_OUT   <- bhc(data, itemLabels, 0, timePoints,"time-course",verbose=TRUE)
plot(BHC_OUT, axes=FALSE)
print(Sys.time() - startTime)

##THIS ONE CRASHES 'R'
BHC_OUT <- bhc(data, itemLabels, timePoints,"time-course",verbose=TRUE)

BTW, I noticed the following code comment in your original email to the list:

# this equals 152000, approximately

Does this imply you might be trying the randomised algorithm for timeseries BHC 
with over 10^5 items?  If so, I would be interested to hear how you get on - 
we've never tried it for such a large data set.

(the feedback would also be very useful, as a have in mind some ideas for a 
next-generation clustering tool, so any insights you gain from running a large 
data set would be very informative.  Thanks!)

Please let me know if you have any further questions.

Best regards,

Rich

-- 
------------------------------------------------------------------
   Dr. Richard Savage			Tel: +44 (0)24 765 72507
   Systems Biology Centre		
   University of Warwick
   Coventry
   CV4 7AL
   United Kingdom

   http://sites.google.com/site/drrichsavage/
   http://21stcenturyscientist.blogspot.com/
------------------------------------------------------------------

On 27/04/2013 18:41, Dan Tenenbaum wrote:
> On Fri, Apr 26, 2013 at 3:00 PM, Joseph Viviano <vivianoj at yorku.ca> wrote:
>> Hello, my apologies for the sloppy post.
>>
>> You can find a sample dataset here: https://www.dropbox.com/sh/p1od9e4vx8ky66a/igt2OkNDbQ
>>
>> And the code I ran was essentially:
>>
>> data       <- read.csv("subsample.csv",header=FALSE)
>> itemLabels <- t(read.csv("labels.csv", header=FALSE)) #read in and transpose
>> timePoints <- 1:24                                    #number of timepoints
>> BHC_OUT    <- bhc(data,itemLabels,timePoints,"time-course",verbose=TRUE,numThreads=8)
>>
>> This is where it completely locks up. Also, note that I get the same result with multiple permutations of the bhc command, and that this occurs on multiple versions of R for me (including the latest releases).
>>
>
> Thanks. It does appear to use increasing amounts of CPU and memory.
> I'm cc'ing the BHC maintainer.
> Dan
>
>
>> I should note that I have demeaned and variance normalized all time series before entering them into bhc, if that makes a difference.
>>
>> Cheers, Joseph
>>
>> On Wed, Apr 24, 2013 at 3:56 PM, Joseph Viviano<vivianoj at yorku.ca>  wrote:
>>
>>> <mailto:bioconductor at r-project.org>Hello all,
>>>
>>> I am having a great deal of trouble getting BHC to run on non-trivial
>>> datasets. I am using the following commands:
>>>
>>> data           <- read.csv("data.csv")
>>
>> Can you share this dataset, or at least enough of it to reproduce the problem?
>>
>>> itemLabels <- names(data)
>>> timePoints <- 1:24 # for the time-course case
>>>
>>> nDataItems <- nrow(data) # this equals 152000, approximately
>>> nFeatures  <- ncol(data)   # this equals 24
>>>
>>> BHC_OUT <- bhc(data,itemLabels,timePoints"time-course",verbose=TRUE)
>>
>> This line produces a syntax error.
>>
>> In order to help you we need a fully reproducible example. Also,
>> please send the output of the sessionInfo() command.
>>
>> Dan
>>
>>
>>> ---
>>>
>>> This causes R to immediately lock up on windows 7, linux mint 13, and
>>> OSX 10.6.8. The input data are variance normalized time-series exported
>>> from MATLAB.
>>>
>>> Here is a sample timeseries from the .csv:
>>>
>>> -1.7858,-0.26742,0.37038,-0.87986,-0.55435,-0.89642,-1.2815,-0.62659,-0.98028,-1.0542,-1.0058,0.51103,0.90252,2.5272,-0.3048,0.81275,0.22414,0.15235,-0.20437,0.2545,0.95103,1.4214,0.82618,0.77179
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Cheers, Joseph
>>>
>>>
>>>           [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>>          [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>