[BioC] problem with data processing in R

Thomas Girke thomas.girke at ucr.edu
Sun Dec 13 23:33:00 CET 2009


Maxim,

Certainly, the sorted heatmap example is only useful for visualization
purposes, while the clustering example is included to show how to perform
clustering on your data in R in general not to provide a final solution
for your research problem. 

If your intention is to cluster the final clustering results of your
factors then you could compute a similarity/distance metric among their
clustering results using a set similarity measure such as the Jaccard
partitioning index. This way you can cluster entire clustering data
sets. For hierarchical clustering results you would first need to
generate discrete clusters from the dendrogams. R's cutree function is
very useful for this. Some examples and links to related R libraries for 
clustering partitioning results are available here:
http://faculty.ucr.edu/~tgirke/Documents/R_BioCond/R_BioCondManual.html#clustering_jaccard

Similarly, you could compute membership similarities among the clusters
obtained for all your factors using again the Jaccard partitioning index
or the variation of information criterion. This could be used to cluster
all the groupings obtained from your clustering results ("clustering of clusters"). 

Since I do not know enough about your data sets or the genomics technology you are
using this case, I am not sure if any of this makes much sense. But perhaps
it is useful for exploring your options...

Thomas 








On Sun, Dec 13, 2009 at 10:28:11PM +0100, Maxim wrote:
> This works nicely.
> 
> It means I cluster the data for one factor and sort the other factors
> according to the clustering result. But what next?
> 
> What I will have to do is to find (cluster together) patterns that are
> similar *between* different factors and not within the data of one factor
> only. As I mentioned the attached data was already clustered in such a
> manner and obviously the patterns of factor 1-3 are somewhat similar.
> Therefore clustering makes not much sense with this data.
> 
> I'm not sure whether I might just miss some aspect you try to explain but as
> far as I got it, there is some essential thing still missing in order to
> accomplish the clustering analysis. When you look at the attached heatmap
> you'll find clusters, where factor 1 and 2 are showing similar patterns,
> others for 2 and 3, again others for 1 and 2 and 3 and so forth. To sort out
> this differences some additional trick has to be done.
> 
> Despite of this my R programming skills get better and better and I think I
> will soon substitute lots of my Perl-Code with R-Code, especially that for
> the data container manipulation tools.
> 
> Maxim
> 
> 
> 
> 
> 
> 
> 2009/12/13 Thomas Girke <thomas.girke at ucr.edu>
> 
> > Great. - The remaining parts of your analysis seem to be solvable
> > with simple subsetting and sorting routines of data objects in R.
> >
> > Here are some more suggestions that might help you here:
> >
> > ## Accessing data components of hclust objects:
> > names(hr)
> > hr$labels; hr$order
> >
> > ## How to return object labels in the order of a hierarchical clustering
> > result:
> > hr$labels[hr$order]
> >
> > ## Sorting vector to join rows of same type in a data matrix (e.g. your
> > factorX data sets)
> > sortv <- as.vector(matrix(1:500, nrow=5, ncol=100, byrow=T))
> > somematrix[sortv, ]
> >
> > Thomas
> >
> >
> > On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote:
> > > Thank you,
> > >
> > > that works great. I was not aware of the "image" function. To do scaling
> > > helped definitely o improve the quality of the heatmaps. In addition I
> > > learned a lot about handling/importing data.
> > >
> > > I have to think about the clustering itself. Your suggestions works fine
> > to
> > > cluster 1 dataset. Unfortunately clustering has to be performed comparing
> > > all 5 datasets with each other. That means a given row in dataset 1 has
> > to
> > > be aligned with the same rows in datasets 2-5. When I perform clustering
> > of
> > > the individual datasets each set will be clustered differently.
> > >
> > > Despite of this you definitely provided a good starting point to better
> > > understand how to exploit R for my project.
> > >
> > > Maxim
> > >
> > > 2009/12/13 Thomas Girke <thomas.girke at ucr.edu>
> > >
> > > > Below are some suggestions for importing your data into R, plotting
> > > > heatmaps and clustering them using hierarchical clustering. For the
> > > > clustering, you have to choose a proper distance measure for your data
> > > > type and of course an efficient partitioning algorithm. The task view
> > > > site I sent previously provides a good overview as to what is
> > available.
> > > > Clustering of 7000 objects in R is not a problem for most clustering
> > > > methods. Many of them, including hclust, are implemented in C++/Fortran
> > > > to run decently time and memory efficient.
> > > >
> > > > I hope this will help to get you started.
> > > >
> > > > Thomas
> > > >
> > > >
> > > > ## Import of data sets (appended in one data frame)
> > > > ## Note: a row of NA values is inserted to visually separate data sets
> > in
> > > > heatmap
> > > > filenames <- list.files(pattern="factor*") # Requires data files
> > "factorX"
> > > > in working dir
> > > > impDF <- data.frame(NULL)
> > > > for(i in filenames) {
> > > >        tmp <- read.delim(i, header=FALSE)
> > > >        impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2]))
> > > > }
> > > > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_")
> > > > rownames(impDF) <- myrownames
> > > > impDF <- impDF[,4:44]
> > > > imp <- as.matrix(impDF)
> > > >
> > > > ## Alternative import into list container (not used in following
> > example)
> > > > filenames <- list.files(pattern="factor*")
> > > > datalist <- lapply(filenames, function(x) read.delim(x, header=F))
> > > > names(datalist) <- filenames
> > > >
> > > > ## Plot heatmap with heatmap.2
> > > > library(gplots)
> > > > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75),
> > > > scale="row", trace="none", key=F)
> > > >
> > > > ## Plot a separate heatmap for each data set using R's image() function
> > > > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101]
> > > > par(mfrow=c(1,5))
> > > > for(i in 1:5) {
> > > >        image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75),
> > xaxt="n",
> > > > yaxt="n", main=i)
> > > > }
> > > >
> > > > ## Example for hierarchical clustering of first data set using hclust
> > > > y <- imp[1:100, ] # Selects first 100 rows (=1st data set).
> > > > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson
> > > > correlations; here you
> > > >                          # want to choose a distance measure that is
> > best
> > > > for your data.
> > > > hr <- hclust(d, method="complete") # Performs hierarchical clustering
> > > > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F,
> > > > col=redgreen(75), scale="row", trace="none")
> > > >
> > > >
> > > >
> > > > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote:
> > > > > Hi,
> > > > >
> > > > > At first: thanks for taking the time!!
> > > > >
> > > > > I could send some of the data and a jpeg that illustates where I
> > would
> > > > like
> > > > > to go to. Unfortunately the data is large. It is genomics data
> > measuring
> > > > > binding of several factors to specific genomic regions. I'd like to
> > > > identify
> > > > > clusters where different factors show similar binding behaviour.
> > > > >
> > > > > My current dataset has data for 10 factors. I'm looking at 7000
> > "sites",
> > > > > each represented by 40 datapoints (that is in 100bp steps from
> > position
> > > > > -2000 to +2000 relative to the sites). Each site represents a certain
> > > > > genomic location and I look at the same sites for every factor.
> > > > >
> > > > > I wonder if this can be done straightforward in R?
> > > > >
> > > > > I attached an example of the data. It is the first 100 "sites" for 5
> > > > factors
> > > > > and the corresponding heatmap (for the complete set) after external
> > > > > clustering.
> > > > >
> > > > > Concerning doing the clustering in R: I have no clue how to do
> > clustering
> > > > > with such "mutli-dimensional" data within R. I'd be glad in case you
> > > > could
> > > > > point me at the right direction how to approach such a task. The
> > > > approaches
> > > > > in the literature appear to be quite complex and need lots of CPU
> > power
> > > > and
> > > > > are scripted in C, I guess for speed reasons. I wonder whether R is
> > fast
> > > > > enough to accomplish such a task in a reasonable time.
> > > > >
> > > > > To explain the data:
> > > > > It is 5 small tab-delimited files, each with data from 100 "sites"
> > for 5
> > > > > factors. The example data corresponds to the bottom of the attached
> > > > heatmap,
> > > > > so it is having clearly positive signals (blue color) at the center
> > > > position
> > > > > for factor 1 and factor 2.
> > > > >
> > > > > I hope this might help to illustrate my project a little better.
> > > > >
> > > > > Maxim
> > > > >
> > > > >
> > > > >
> > > > > 2009/12/11 Thomas Girke <thomas.girke at ucr.edu>
> > > > >
> > > > > > The example I send before works but there was a typo in the
> > heatmap.2
> > > > > > command
> > > > > > where mysort needs to replaced by mydata. Like this:
> > > > > >
> > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none",
> > > > key=T,
> > > > > > cellnote=mysamples)
> > > > > >
> > > > > > In heatmap.2 you have the option to include a color bar on the left
> > > > side
> > > > > > that
> > > > > > can be used to highlight clusters. See the help documentation for
> > more
> > > > > > details.
> > > > > >
> > > > > > The dimensions of heatmap.2 plots can be controlled like for any
> > other
> > > > plot
> > > > > > in
> > > > > > R, using the hight and width arguments, e.g. x11(height=6, width=2)
> > or
> > > > > > pdf(...).
> > > > > >
> > > > > > To provide more specific help, you may want to send a simple sample
> > > > data
> > > > > > set that
> > > > > > illustrates what you are trying to do exactly. Without this it is
> > > > really
> > > > > > hard
> > > > > > to understand your problem.
> > > > > >
> > > > > > If I were you then I would perform the entire clustering procedure
> > in
> > > > R. Is
> > > > > > there any
> > > > > > good reason not use R for this? For hierarchical clustering you can
> > use
> > > > the
> > > > > > hclust function.
> > > > > > A relatively complete list of clustering algorithms available in R
> > can
> > > > be
> > > > > > found on
> > > > > > the cluster task view page:
> > > > > > http://cran.at.r-project.org/web/views/Cluster.html
> > > > > >
> > > > > > Thomas
> > > > > >
> > > > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > what you suggested sounds interesting but actually I do not
> > > > understand
> > > > > > where
> > > > > > > it is going to. Actually I'll get an error doing it like this as
> > > > mysort
> > > > > > in
> > > > > > > heatmap.2 is not defined (yet).
> > > > > > >
> > > > > > > What did work for me in meantime, is simply to plot the complete
> > > > heatmap
> > > > > > at
> > > > > > > once. What is missing in this approach is a label on the left or
> > > > right
> > > > > > side
> > > > > > > of the heatmap, I'd love to have a colorcoded block that allows
> > me to
> > > > > > see,
> > > > > > > where different IDs were plotted (different IDs are actually
> > clusters
> > > > > > coming
> > > > > > > from hierarchical clustering not performed in R).
> > > > > > >
> > > > > > > Second I can do it by generating individual heatmaps for each ID
> > > > loaded
> > > > > > from
> > > > > > > individual files, unfortunately for some IDs there are thousand
> > rows
> > > > of
> > > > > > > data, for others only 50. But R's heatmap always produces
> > similarly
> > > > sized
> > > > > > > maps. I'd prefer to have the height of the individual heatmaps
> > > > according
> > > > > > to
> > > > > > > the corresponding number of rows rather than automatic scaling.
> > > > > > >
> > > > > > > Is there a way to do this in R? I found an old mail in the
> > mailing
> > > > list
> > > > > > > discussing this point, the result was to use TreeView/Cluster,
> > but I
> > > > > > cannot
> > > > > > > get this to work without doing clustering (the data is clustered
> > > > > > already),
> > > > > > > additionally I do not know how to do batch processing in
> > TreeView.
> > > > > > >
> > > > > > > Maxim
> > > > > > > 2009/12/10 Thomas Girke <thomas.girke at ucr.edu>
> > > > > > >
> > > > > > > > I am not sure if I understand every part of your problem
> > correctly,
> > > > > > > > but here is an example how something like this could be done in
> > R.
> > > > > > > > Its main idea is to keep the entire data set in one matrix and
> > use
> > > > > > > > the cell note feature of heatmap.2 for sample tracking.
> > > > > > > >
> > > > > > > > ## Sample matrix for demo purpose. If your
> > > > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10,
> > > > sep=""),
> > > > > > > > paste("t", 1:5, sep="")))
> > > > > > > >
> > > > > > > > ## Sort each row by its values
> > > > > > > > mydata <- t(apply(y, 1, sort))
> > > > > > > >
> > > > > > > > ## Obtain sample labels (column titles) for sorted rows
> > > > > > > > mysamples <-  t(apply(y, 1, function(x) names(sort(x))))
> > > > > > > >
> > > > > > > > ## Plot heatmap where the sample labels are given as cell notes
> > for
> > > > > > > > tracking purposes
> > > > > > > > library(gplots)
> > > > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F,
> > > > col=redgreen(75),
> > > > > > > > scale="row", trace="none", key=T, cellnote=mysamples)
> > > > > > > >
> > > > > > > > Thomas
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote:
> > > > > > > > > Hi,
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'm stuck with parsing data into R for heatmap
> > representation.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The data looks like:
> > > > > > > > >
> > > > > > > > > 1 id1 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 2 id1 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 3 id1 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 4 id1 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > .........
> > > > > > > > >
> > > > > > > > > 348 id2 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 349 id2 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 350 id2 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > 351 id2 x1 x2 x3 .... x20
> > > > > > > > >
> > > > > > > > > .........
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like
> > to
> > > > > > produce
> > > > > > > > 40
> > > > > > > > > heatmaps thereof, 1 heatmap per data corresponding to a
> > single
> > > > ID.
> > > > > >  The
> > > > > > > > data
> > > > > > > > > that has to be plotted is 20 values (x1 to x20). There is
> > > > different
> > > > > > > > amounts
> > > > > > > > > of data for respective IDs. In the end I'd like to have the
> > 40
> > > > > > heatmaps
> > > > > > > > > stacked on top of each other sorted by ID and heatmap heights
> > > > > > according
> > > > > > > > to
> > > > > > > > > the amount (number of rows) of data. Unfortunately the
> > individual
> > > > > > data
> > > > > > > > lines
> > > > > > > > > have to be sorted with respect to the maximum of the values
> > X1 to
> > > > x20
> > > > > > in
> > > > > > > > > individual rows. Actually this not that important as I guess
> > this
> > > > > > might
> > > > > > > > be
> > > > > > > > > easier to realize in upstream Perl scripts producing the
> > data.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The data is available as data per ID in individual files or
> > as a
> > > > > > sorted
> > > > > > > > file
> > > > > > > > > with the complete dataset (as shown above).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Is it possible in R to break a file as above into distinct
> > blocks
> > > > > > > > (depending
> > > > > > > > > on ID) and then to process it (sorting according to maximum,
> > > > > > heatmap)?
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Which commands do I have to issue for the manipulation of the
> > > > > > data.frame?
> > > > > > > > I
> > > > > > > > > tried the
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I'd be glad  if someone could help me finding the correct
> > > > direction
> > > > > > to
> > > > > > > > solve
> > > > > > > > > my problem!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best regards
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Maxim
> > > > > > > > >
> > > > > > > > >       [[alternative HTML version deleted]]
> > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > Bioconductor mailing list
> > > > > > i> > > Bioconductor at stat.math.ethz.ch
> > > > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > > > > Search the archives:
> > > > > > > >
> > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > > >
> > > >
> >



More information about the Bioconductor mailing list