[BioC] problem with data processing in R

Sun Dec 13 19:30:08 CET 2009

Great. - The remaining parts of your analysis seem to be solvable
with simple subsetting and sorting routines of data objects in R. 

Here are some more suggestions that might help you here:

## Accessing data components of hclust objects:
names(hr)
hr$labels; hr$order

## How to return object labels in the order of a hierarchical clustering result:
hr$labels[hr$order]

## Sorting vector to join rows of same type in a data matrix (e.g. your factorX data sets)
sortv <- as.vector(matrix(1:500, nrow=5, ncol=100, byrow=T))
somematrix[sortv, ] 

Thomas

On Sun, Dec 13, 2009 at 11:38:54AM +0100, Maxim wrote:
> Thank you,
> 
> that works great. I was not aware of the "image" function. To do scaling
> helped definitely o improve the quality of the heatmaps. In addition I
> learned a lot about handling/importing data.
> 
> I have to think about the clustering itself. Your suggestions works fine to
> cluster 1 dataset. Unfortunately clustering has to be performed comparing
> all 5 datasets with each other. That means a given row in dataset 1 has to
> be aligned with the same rows in datasets 2-5. When I perform clustering of
> the individual datasets each set will be clustered differently.
> 
> Despite of this you definitely provided a good starting point to better
> understand how to exploit R for my project.
> 
> Maxim
> 
> 2009/12/13 Thomas Girke <thomas.girke at ucr.edu>
> 
> > Below are some suggestions for importing your data into R, plotting
> > heatmaps and clustering them using hierarchical clustering. For the
> > clustering, you have to choose a proper distance measure for your data
> > type and of course an efficient partitioning algorithm. The task view
> > site I sent previously provides a good overview as to what is available.
> > Clustering of 7000 objects in R is not a problem for most clustering
> > methods. Many of them, including hclust, are implemented in C++/Fortran
> > to run decently time and memory efficient.
> >
> > I hope this will help to get you started.
> >
> > Thomas
> >
> >
> > ## Import of data sets (appended in one data frame)
> > ## Note: a row of NA values is inserted to visually separate data sets in
> > heatmap
> > filenames <- list.files(pattern="factor*") # Requires data files "factorX"
> > in working dir
> > impDF <- data.frame(NULL)
> > for(i in filenames) {
> >        tmp <- read.delim(i, header=FALSE)
> >        impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2]))
> > }
> > myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_")
> > rownames(impDF) <- myrownames
> > impDF <- impDF[,4:44]
> > imp <- as.matrix(impDF)
> >
> > ## Alternative import into list container (not used in following example)
> > filenames <- list.files(pattern="factor*")
> > datalist <- lapply(filenames, function(x) read.delim(x, header=F))
> > names(datalist) <- filenames
> >
> > ## Plot heatmap with heatmap.2
> > library(gplots)
> > heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75),
> > scale="row", trace="none", key=F)
> >
> > ## Plot a separate heatmap for each data set using R's image() function
> > index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101]
> > par(mfrow=c(1,5))
> > for(i in 1:5) {
> >        image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75), xaxt="n",
> > yaxt="n", main=i)
> > }
> >
> > ## Example for hierarchical clustering of first data set using hclust
> > y <- imp[1:100, ] # Selects first 100 rows (=1st data set).
> > d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson
> > correlations; here you
> >                          # want to choose a distance measure that is best
> > for your data.
> > hr <- hclust(d, method="complete") # Performs hierarchical clustering
> > heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F,
> > col=redgreen(75), scale="row", trace="none")
> >
> >
> >
> > On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote:
> > > Hi,
> > >
> > > At first: thanks for taking the time!!
> > >
> > > I could send some of the data and a jpeg that illustates where I would
> > like
> > > to go to. Unfortunately the data is large. It is genomics data measuring
> > > binding of several factors to specific genomic regions. I'd like to
> > identify
> > > clusters where different factors show similar binding behaviour.
> > >
> > > My current dataset has data for 10 factors. I'm looking at 7000 "sites",
> > > each represented by 40 datapoints (that is in 100bp steps from position
> > > -2000 to +2000 relative to the sites). Each site represents a certain
> > > genomic location and I look at the same sites for every factor.
> > >
> > > I wonder if this can be done straightforward in R?
> > >
> > > I attached an example of the data. It is the first 100 "sites" for 5
> > factors
> > > and the corresponding heatmap (for the complete set) after external
> > > clustering.
> > >
> > > Concerning doing the clustering in R: I have no clue how to do clustering
> > > with such "mutli-dimensional" data within R. I'd be glad in case you
> > could
> > > point me at the right direction how to approach such a task. The
> > approaches
> > > in the literature appear to be quite complex and need lots of CPU power
> > and
> > > are scripted in C, I guess for speed reasons. I wonder whether R is fast
> > > enough to accomplish such a task in a reasonable time.
> > >
> > > To explain the data:
> > > It is 5 small tab-delimited files, each with data from 100 "sites" for 5
> > > factors. The example data corresponds to the bottom of the attached
> > heatmap,
> > > so it is having clearly positive signals (blue color) at the center
> > position
> > > for factor 1 and factor 2.
> > >
> > > I hope this might help to illustrate my project a little better.
> > >
> > > Maxim
> > >
> > >
> > >
> > > 2009/12/11 Thomas Girke <thomas.girke at ucr.edu>
> > >
> > > > The example I send before works but there was a typo in the heatmap.2
> > > > command
> > > > where mysort needs to replaced by mydata. Like this:
> > > >
> > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none",
> > key=T,
> > > > cellnote=mysamples)
> > > >
> > > > In heatmap.2 you have the option to include a color bar on the left
> > side
> > > > that
> > > > can be used to highlight clusters. See the help documentation for more
> > > > details.
> > > >
> > > > The dimensions of heatmap.2 plots can be controlled like for any other
> > plot
> > > > in
> > > > R, using the hight and width arguments, e.g. x11(height=6, width=2) or
> > > > pdf(...).
> > > >
> > > > To provide more specific help, you may want to send a simple sample
> > data
> > > > set that
> > > > illustrates what you are trying to do exactly. Without this it is
> > really
> > > > hard
> > > > to understand your problem.
> > > >
> > > > If I were you then I would perform the entire clustering procedure in
> > R. Is
> > > > there any
> > > > good reason not use R for this? For hierarchical clustering you can use
> > the
> > > > hclust function.
> > > > A relatively complete list of clustering algorithms available in R can
> > be
> > > > found on
> > > > the cluster task view page:
> > > > http://cran.at.r-project.org/web/views/Cluster.html
> > > >
> > > > Thomas
> > > >
> > > > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote:
> > > > > Hi,
> > > > >
> > > > > what you suggested sounds interesting but actually I do not
> > understand
> > > > where
> > > > > it is going to. Actually I'll get an error doing it like this as
> > mysort
> > > > in
> > > > > heatmap.2 is not defined (yet).
> > > > >
> > > > > What did work for me in meantime, is simply to plot the complete
> > heatmap
> > > > at
> > > > > once. What is missing in this approach is a label on the left or
> > right
> > > > side
> > > > > of the heatmap, I'd love to have a colorcoded block that allows me to
> > > > see,
> > > > > where different IDs were plotted (different IDs are actually clusters
> > > > coming
> > > > > from hierarchical clustering not performed in R).
> > > > >
> > > > > Second I can do it by generating individual heatmaps for each ID
> > loaded
> > > > from
> > > > > individual files, unfortunately for some IDs there are thousand rows
> > of
> > > > > data, for others only 50. But R's heatmap always produces similarly
> > sized
> > > > > maps. I'd prefer to have the height of the individual heatmaps
> > according
> > > > to
> > > > > the corresponding number of rows rather than automatic scaling.
> > > > >
> > > > > Is there a way to do this in R? I found an old mail in the mailing
> > list
> > > > > discussing this point, the result was to use TreeView/Cluster, but I
> > > > cannot
> > > > > get this to work without doing clustering (the data is clustered
> > > > already),
> > > > > additionally I do not know how to do batch processing in TreeView.
> > > > >
> > > > > Maxim
> > > > > 2009/12/10 Thomas Girke <thomas.girke at ucr.edu>
> > > > >
> > > > > > I am not sure if I understand every part of your problem correctly,
> > > > > > but here is an example how something like this could be done in R.
> > > > > > Its main idea is to keep the entire data set in one matrix and use
> > > > > > the cell note feature of heatmap.2 for sample tracking.
> > > > > >
> > > > > > ## Sample matrix for demo purpose. If your
> > > > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10,
> > sep=""),
> > > > > > paste("t", 1:5, sep="")))
> > > > > >
> > > > > > ## Sort each row by its values
> > > > > > mydata <- t(apply(y, 1, sort))
> > > > > >
> > > > > > ## Obtain sample labels (column titles) for sorted rows
> > > > > > mysamples <-  t(apply(y, 1, function(x) names(sort(x))))
> > > > > >
> > > > > > ## Plot heatmap where the sample labels are given as cell notes for
> > > > > > tracking purposes
> > > > > > library(gplots)
> > > > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F,
> > col=redgreen(75),
> > > > > > scale="row", trace="none", key=T, cellnote=mysamples)
> > > > > >
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > >
> > > > > > > I'm stuck with parsing data into R for heatmap representation.
> > > > > > >
> > > > > > >
> > > > > > > The data looks like:
> > > > > > >
> > > > > > > 1 id1 x1 x2 x3 .... x20
> > > > > > >
> > > > > > > 2 id1 x1 x2 x3 .... x20
> > > > > > >
> > > > > > > 3 id1 x1 x2 x3 .... x20
> > > > > > >
> > > > > > > 4 id1 x1 x2 x3 .... x20
> > > > > > >
> > > > > > > .........
> > > > > > >
> > > > > > > 348 id2 x1 x2 x3 .... x20
> > > > > > >
> > > > > > > 349 id2 x1 x2 x3 .... x20
> > > > > > >
> > > > > > > 350 id2 x1 x2 x3 .... x20
> > > > > > >
> > > > > > > 351 id2 x1 x2 x3 .... x20
> > > > > > >
> > > > > > > .........
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to
> > > > produce
> > > > > > 40
> > > > > > > heatmaps thereof, 1 heatmap per data corresponding to a single
> > ID.
> > > >  The
> > > > > > data
> > > > > > > that has to be plotted is 20 values (x1 to x20). There is
> > different
> > > > > > amounts
> > > > > > > of data for respective IDs. In the end I'd like to have the 40
> > > > heatmaps
> > > > > > > stacked on top of each other sorted by ID and heatmap heights
> > > > according
> > > > > > to
> > > > > > > the amount (number of rows) of data. Unfortunately the individual
> > > > data
> > > > > > lines
> > > > > > > have to be sorted with respect to the maximum of the values X1 to
> > x20
> > > > in
> > > > > > > individual rows. Actually this not that important as I guess this
> > > > might
> > > > > > be
> > > > > > > easier to realize in upstream Perl scripts producing the data.
> > > > > > >
> > > > > > >
> > > > > > > The data is available as data per ID in individual files or as a
> > > > sorted
> > > > > > file
> > > > > > > with the complete dataset (as shown above).
> > > > > > >
> > > > > > >
> > > > > > > Is it possible in R to break a file as above into distinct blocks
> > > > > > (depending
> > > > > > > on ID) and then to process it (sorting according to maximum,
> > > > heatmap)?
> > > > > > >
> > > > > > >
> > > > > > > Which commands do I have to issue for the manipulation of the
> > > > data.frame?
> > > > > > I
> > > > > > > tried the
> > > > > > >
> > > > > > >
> > > > > > > I'd be glad  if someone could help me finding the correct
> > direction
> > > > to
> > > > > > solve
> > > > > > > my problem!
> > > > > > >
> > > > > > >
> > > > > > > Best regards
> > > > > > >
> > > > > > >
> > > > > > > Maxim
> > > > > > >
> > > > > > >       [[alternative HTML version deleted]]
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Bioconductor mailing list
> > > > i> > > Bioconductor at stat.math.ethz.ch
> > > > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > > > Search the archives:
> > > > > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > > > >
> > > > > >
> > > >
> >
> >
> >