[BioC] How to do clustering

Martin Morgan mtmorgan at fhcrc.org
Wed Jun 20 06:10:06 CEST 2007


Alex,

> library(Biobase)
[snip]
> args(rowQ)
function (imat, which) 
NULL
> showMethods("rowQ")
Function: rowQ (package Biobase)
imat="ExpressionSet", which="numeric"
imat="exprSet", which="numeric"
imat="matrix", which="numeric"

so it looks like x should be a matrix rather than a data frame. 

Martin

"ssls sddd" <ssls.sddd at gmail.com> writes:

> Hi Thomas,
>
> Thanks! Sorry for getting back to it late because I was out
> of town for a couple of days.
>
> I like the idea of 'removing all rows with low variability across
> samples'. I searched around and found an online tutorial
> http://www.economia.unimi.it/projects/marray/2006/material/Lab3/MachineLearning/ML-lab.pdfis
> doing very similar thing which teaches how to filter some
> undifferentially
> expressed genes.
>
> It takes the simplistic approach of using the 75th percentile of the
> interquartile range
> (IQR) as the cut-off point and computes quantiles using rowQ.
>
> I followed their method and my code is:
>
> library("Biobase")
> lowQ = rowQ(x, floor(0.25 * 49))#49 for 49 samples
> upQ = rowQ(x, ceiling(0.75 * 49))
> iqrs = upQ - lowQ
> giqr = iqrs > quantile(iqrs, probs = 0.75)
> sum(giqr)
> xsub = x[giqr, ]
> dim(xsub)
>
> But the error message is like:
>
> function (classes, fdef, mtable)  :
>         unable to find an inherited method for function "rowQ", for
> signature "data.frame", "numeric"
>
> Perhaps you can any experience in using 'rowQ'? If I want to use IQR
> function, how should I approach this?
>
> I really appreciate your help!
>
> Thank you very much!
>
> Sincerely,
>
> Alex
>
>
>
> On 6/13/07, Thomas Girke <thomas.girke at ucr.edu> wrote:
>>
>> Dear Alex,
>>
>> In addition, to Sean's advice, I would like to point out that the
>> sample you are giving below indicates that you are trying to pass on
>> to the heatmap function a column dendrogram plus a row dendrogram. With
>> your
>> matrix of 238,000 rows by 49 columns you should have only a column
>> dendrogram, because the row dendrogram would take more than 200 GB of
>> memory to
>> calculate. You can still use the heatmap or heatmap.2 functions by turning
>> off the row
>> sorting by setting the Rowv argument to NA. In addition to this, I would
>> consider to filter your rows in a meaningful manner to a much smaller
>> number, perhaps by using R's IQR function to remove all rows with very
>> low variability. I am suggesting this because, you won't see any
>> patterns in the heatmap when you have so many rows. If the row filtering
>> works then you could generate a dendrogram for the row dimension as well.
>> Remember: hclust will require ~4 GB of memory to cluster ~30,000 items
>> and < 1 GB for 10,000 items, and pvclust that uses hclust internally will
>> need even much more than this.
>>
>> As a more general advice, when working with large data sets in R always
>> subset
>> your data to something very small to test out your strategy first, because
>> this
>> will save you a lot of time.
>> In your case, this could by done by selecting just the first 100 rows of
>> your
>> matrix like this:
>>                 my_matrix <- my_matrix[1:100, ]
>>
>> Once you have tested things out then just remove in your script/protocol
>> the '[1:100,]' part.
>>
>> Best,
>>
>> Thomas
>>
>>
>> On Wed 06/13/07 06:02, Sean Davis wrote:
>> > ssls sddd wrote:
>> > > Dear Dr.Thomas Girke,
>> > >
>> > > I have one more question for you. I tried pvclust in the session of
>> > > 'Obtain significant clusters by pvclust bootstrap analysis' for my
>> data, x.
>> > >
>> > > But I have a problem with:
>> > >
>> > > heatmap(x, Rowv=dend_colored, Colv=as.dendrogram(hc), col=my.colorFct
>> (),
>> > > scale="row", RowSideColors=mycolhc)
>> > >
>> > > the error was:
>> > >
>> > > error in heatmap(x, Rowv = dend_colored, Colv = as.dendrogram(hc), col
>> =
>> > > my.colorFct(),  :
>> > >         'x' must be a numeric matrix
>> > >
>> > > I ran 'x[1:3,1:3]' and it produced the following:
>> > >
>> > >               AIRNS_A09 AIRNS_A11 AIRNS_A12
>> > > SNP_A-1780271   1.85642   1.50956   1.73154
>> > > SNP_A-1780274   1.72140   1.83712   1.85948
>> > > SNP_A-1780277   2.04241   1.53458   1.65270
>> > >
>> > > I think the x is a numeric matrix. Do you think where I may get wrong?
>> >
>> > Try coercing the x into a matrix directly:
>> >
>> > heatmap(as.matrix(x), Rowv=dend_colored, Colv=as.dendrogram(hc),
>> > col=my.colorFct(), scale="row", RowSideColors=mycolhc)
>> >
>> > Does this fix the problem?  You can always check the class of an object
>> > by doing something like:
>> >
>> > class(x)
>> >
>> > which should report:
>> >
>> > [1] "matrix"
>> >
>> > Hope that helps.
>> >
>> > Sean
>> >
>>
>> --
>> Dr. Thomas Girke
>> Assistant Professor of Bioinformatics
>> Director, IIGB Bioinformatic Facility
>> Center for Plant Cell Biology (CEPCEB)
>> Institute for Integrative Genome Biology (IIGB)
>> Department of Botany and Plant Sciences
>> 1008 Noel T. Keen Hall
>> University of California
>> Riverside, CA 92521
>>
>> E-mail: thomas.girke at ucr.edu
>> Website: http://faculty.ucr.edu/~tgirke <http://faculty.ucr.edu/%7Etgirke>
>> Ph: 951-827-2469
>> Fax: 951-827-4437
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Bioconductor / Computational Biology
http://bioconductor.org



More information about the Bioconductor mailing list