[BioC] BHC appears to be broken
Rich Savage
r.s.savage at warwick.ac.uk
Mon Apr 29 14:51:36 CEST 2013
Dear Joseph,
I believe the problem is that you're calling 'bhc' incorrectly. There is a
legacy value that needs to be entered as the 3rd input, which you're missing
(see below). Apologies for the presence of this - it's no longer required by
the algorithm, but is there to prevent people's scripts crashing that are set up
to use an early release of the package. The document examples for 'bhc' have it
set correctly, but it's not highlighted particularly.
(if we do a version 2.0, I'll sure we'll bite the bullet and remove this)
Here is a modified version of your code that should demonstrate both how to make
it work, and also how to make it crash. (tested on R 3.0.0 Snow Leopard build
for MAC OSX).
library(BHC)
data <- read.csv("subsample.csv",header=FALSE)
itemLabels <- t(read.csv("~labels.csv", header=FALSE)) #read in and transpose
timePoints <- 1:24 #number of timepoints
##THIS ONE WORKS
startTime <- Sys.time()
BHC_OUT <- bhc(data, itemLabels, 0, timePoints,"time-course",verbose=TRUE)
plot(BHC_OUT, axes=FALSE)
print(Sys.time() - startTime)
##THIS ONE CRASHES 'R'
BHC_OUT <- bhc(data, itemLabels, timePoints,"time-course",verbose=TRUE)
BTW, I noticed the following code comment in your original email to the list:
# this equals 152000, approximately
Does this imply you might be trying the randomised algorithm for timeseries BHC
with over 10^5 items? If so, I would be interested to hear how you get on -
we've never tried it for such a large data set.
(the feedback would also be very useful, as a have in mind some ideas for a
next-generation clustering tool, so any insights you gain from running a large
data set would be very informative. Thanks!)
Please let me know if you have any further questions.
Best regards,
Rich
--
------------------------------------------------------------------
Dr. Richard Savage Tel: +44 (0)24 765 72507
Systems Biology Centre
University of Warwick
Coventry
CV4 7AL
United Kingdom
http://sites.google.com/site/drrichsavage/
http://21stcenturyscientist.blogspot.com/
------------------------------------------------------------------
On 27/04/2013 18:41, Dan Tenenbaum wrote:
> On Fri, Apr 26, 2013 at 3:00 PM, Joseph Viviano <vivianoj at yorku.ca> wrote:
>> Hello, my apologies for the sloppy post.
>>
>> You can find a sample dataset here: https://www.dropbox.com/sh/p1od9e4vx8ky66a/igt2OkNDbQ
>>
>> And the code I ran was essentially:
>>
>> data <- read.csv("subsample.csv",header=FALSE)
>> itemLabels <- t(read.csv("labels.csv", header=FALSE)) #read in and transpose
>> timePoints <- 1:24 #number of timepoints
>> BHC_OUT <- bhc(data,itemLabels,timePoints,"time-course",verbose=TRUE,numThreads=8)
>>
>> This is where it completely locks up. Also, note that I get the same result with multiple permutations of the bhc command, and that this occurs on multiple versions of R for me (including the latest releases).
>>
>
> Thanks. It does appear to use increasing amounts of CPU and memory.
> I'm cc'ing the BHC maintainer.
> Dan
>
>
>> I should note that I have demeaned and variance normalized all time series before entering them into bhc, if that makes a difference.
>>
>> Cheers, Joseph
>>
>> On Wed, Apr 24, 2013 at 3:56 PM, Joseph Viviano<vivianoj at yorku.ca> wrote:
>>
>>> <mailto:bioconductor at r-project.org>Hello all,
>>>
>>> I am having a great deal of trouble getting BHC to run on non-trivial
>>> datasets. I am using the following commands:
>>>
>>> data <- read.csv("data.csv")
>>
>> Can you share this dataset, or at least enough of it to reproduce the problem?
>>
>>> itemLabels <- names(data)
>>> timePoints <- 1:24 # for the time-course case
>>>
>>> nDataItems <- nrow(data) # this equals 152000, approximately
>>> nFeatures <- ncol(data) # this equals 24
>>>
>>> BHC_OUT <- bhc(data,itemLabels,timePoints"time-course",verbose=TRUE)
>>
>> This line produces a syntax error.
>>
>> In order to help you we need a fully reproducible example. Also,
>> please send the output of the sessionInfo() command.
>>
>> Dan
>>
>>
>>> ---
>>>
>>> This causes R to immediately lock up on windows 7, linux mint 13, and
>>> OSX 10.6.8. The input data are variance normalized time-series exported
>>> from MATLAB.
>>>
>>> Here is a sample timeseries from the .csv:
>>>
>>> -1.7858,-0.26742,0.37038,-0.87986,-0.55435,-0.89642,-1.2815,-0.62659,-0.98028,-1.0542,-1.0058,0.51103,0.90252,2.5272,-0.3048,0.81275,0.22414,0.15235,-0.20437,0.2545,0.95103,1.4214,0.82618,0.77179
>>>
>>> Any help would be greatly appreciated.
>>>
>>> Cheers, Joseph
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
More information about the Bioconductor
mailing list