[BioC] problem with data processing in R

Sun Dec 13 02:43:07 CET 2009

Below are some suggestions for importing your data into R, plotting
heatmaps and clustering them using hierarchical clustering. For the
clustering, you have to choose a proper distance measure for your data
type and of course an efficient partitioning algorithm. The task view
site I sent previously provides a good overview as to what is available.
Clustering of 7000 objects in R is not a problem for most clustering
methods. Many of them, including hclust, are implemented in C++/Fortran
to run decently time and memory efficient. 

I hope this will help to get you started.

Thomas

## Import of data sets (appended in one data frame)
## Note: a row of NA values is inserted to visually separate data sets in heatmap
filenames <- list.files(pattern="factor*") # Requires data files "factorX" in working dir 
impDF <- data.frame(NULL) 
for(i in filenames) { 
	tmp <- read.delim(i, header=FALSE)
       	impDF <- rbind(impDF, tmp, rep(NA, dim(tmp)[2]))  
} 
myrownames <- paste(1:length(impDF[,1]), impDF[,2], impDF[,3], sep="_")
rownames(impDF) <- myrownames
impDF <- impDF[,4:44]
imp <- as.matrix(impDF)

## Alternative import into list container (not used in following example)
filenames <- list.files(pattern="factor*")
datalist <- lapply(filenames, function(x) read.delim(x, header=F))
names(datalist) <- filenames

## Plot heatmap with heatmap.2
library(gplots)
heatmap.2(imp, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75), scale="row", trace="none", key=F)

## Plot a separate heatmap for each data set using R's image() function 
index <- matrix(1:505, nrow=5, ncol=101, byrow=T)[,-101] 
par(mfrow=c(1,5))
for(i in 1:5) { 
	image(scale(t(imp[rev(index[i,]), ])), col=redgreen(75), xaxt="n", yaxt="n", main=i) 
}

## Example for hierarchical clustering of first data set using hclust
y <- imp[1:100, ] # Selects first 100 rows (=1st data set).
d <- as.dist(1-cor(t(y))) # Creates a distance matrix using Pearson correlations; here you 
                          # want to choose a distance measure that is best for your data.
hr <- hclust(d, method="complete") # Performs hierarchical clustering
heatmap.2(y, Rowv=as.dendrogram(hr), dendrogram="row", Colv=F, col=redgreen(75), scale="row", trace="none")

On Sat, Dec 12, 2009 at 11:25:05PM +0100, Maxim wrote:
> Hi,
> 
> At first: thanks for taking the time!!
> 
> I could send some of the data and a jpeg that illustates where I would like
> to go to. Unfortunately the data is large. It is genomics data measuring
> binding of several factors to specific genomic regions. I'd like to identify
> clusters where different factors show similar binding behaviour.
> 
> My current dataset has data for 10 factors. I'm looking at 7000 "sites",
> each represented by 40 datapoints (that is in 100bp steps from position
> -2000 to +2000 relative to the sites). Each site represents a certain
> genomic location and I look at the same sites for every factor.
> 
> I wonder if this can be done straightforward in R?
> 
> I attached an example of the data. It is the first 100 "sites" for 5 factors
> and the corresponding heatmap (for the complete set) after external
> clustering.
> 
> Concerning doing the clustering in R: I have no clue how to do clustering
> with such "mutli-dimensional" data within R. I'd be glad in case you could
> point me at the right direction how to approach such a task. The approaches
> in the literature appear to be quite complex and need lots of CPU power and
> are scripted in C, I guess for speed reasons. I wonder whether R is fast
> enough to accomplish such a task in a reasonable time.
> 
> To explain the data:
> It is 5 small tab-delimited files, each with data from 100 "sites" for 5
> factors. The example data corresponds to the bottom of the attached heatmap,
> so it is having clearly positive signals (blue color) at the center position
> for factor 1 and factor 2.
> 
> I hope this might help to illustrate my project a little better.
> 
> Maxim
> 
> 
> 
> 2009/12/11 Thomas Girke <thomas.girke at ucr.edu>
> 
> > The example I send before works but there was a typo in the heatmap.2
> > command
> > where mysort needs to replaced by mydata. Like this:
> >
> > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, trace="none", key=T,
> > cellnote=mysamples)
> >
> > In heatmap.2 you have the option to include a color bar on the left side
> > that
> > can be used to highlight clusters. See the help documentation for more
> > details.
> >
> > The dimensions of heatmap.2 plots can be controlled like for any other plot
> > in
> > R, using the hight and width arguments, e.g. x11(height=6, width=2) or
> > pdf(...).
> >
> > To provide more specific help, you may want to send a simple sample data
> > set that
> > illustrates what you are trying to do exactly. Without this it is really
> > hard
> > to understand your problem.
> >
> > If I were you then I would perform the entire clustering procedure in R. Is
> > there any
> > good reason not use R for this? For hierarchical clustering you can use the
> > hclust function.
> > A relatively complete list of clustering algorithms available in R can be
> > found on
> > the cluster task view page:
> > http://cran.at.r-project.org/web/views/Cluster.html
> >
> > Thomas
> >
> > On Fri, Dec 11, 2009 at 10:17:22PM +0100, Maxim wrote:
> > > Hi,
> > >
> > > what you suggested sounds interesting but actually I do not understand
> > where
> > > it is going to. Actually I'll get an error doing it like this as mysort
> > in
> > > heatmap.2 is not defined (yet).
> > >
> > > What did work for me in meantime, is simply to plot the complete heatmap
> > at
> > > once. What is missing in this approach is a label on the left or right
> > side
> > > of the heatmap, I'd love to have a colorcoded block that allows me to
> > see,
> > > where different IDs were plotted (different IDs are actually clusters
> > coming
> > > from hierarchical clustering not performed in R).
> > >
> > > Second I can do it by generating individual heatmaps for each ID loaded
> > from
> > > individual files, unfortunately for some IDs there are thousand rows of
> > > data, for others only 50. But R's heatmap always produces similarly sized
> > > maps. I'd prefer to have the height of the individual heatmaps according
> > to
> > > the corresponding number of rows rather than automatic scaling.
> > >
> > > Is there a way to do this in R? I found an old mail in the mailing list
> > > discussing this point, the result was to use TreeView/Cluster, but I
> > cannot
> > > get this to work without doing clustering (the data is clustered
> > already),
> > > additionally I do not know how to do batch processing in TreeView.
> > >
> > > Maxim
> > > 2009/12/10 Thomas Girke <thomas.girke at ucr.edu>
> > >
> > > > I am not sure if I understand every part of your problem correctly,
> > > > but here is an example how something like this could be done in R.
> > > > Its main idea is to keep the entire data set in one matrix and use
> > > > the cell note feature of heatmap.2 for sample tracking.
> > > >
> > > > ## Sample matrix for demo purpose. If your
> > > > y <- matrix(rnorm(50), 10, 5, dimnames=list(paste("g", 1:10, sep=""),
> > > > paste("t", 1:5, sep="")))
> > > >
> > > > ## Sort each row by its values
> > > > mydata <- t(apply(y, 1, sort))
> > > >
> > > > ## Obtain sample labels (column titles) for sorted rows
> > > > mysamples <-  t(apply(y, 1, function(x) names(sort(x))))
> > > >
> > > > ## Plot heatmap where the sample labels are given as cell notes for
> > > > tracking purposes
> > > > library(gplots)
> > > > heatmap.2(mydata, dendrogram="none", Rowv=F, Colv=F, col=redgreen(75),
> > > > scale="row", trace="none", key=T, cellnote=mysamples)
> > > >
> > > > Thomas
> > > >
> > > >
> > > > On Thu, Dec 10, 2009 at 03:13:31PM +0100, Maxim wrote:
> > > > > Hi,
> > > > >
> > > > >
> > > > > I'm stuck with parsing data into R for heatmap representation.
> > > > >
> > > > >
> > > > > The data looks like:
> > > > >
> > > > > 1 id1 x1 x2 x3 .... x20
> > > > >
> > > > > 2 id1 x1 x2 x3 .... x20
> > > > >
> > > > > 3 id1 x1 x2 x3 .... x20
> > > > >
> > > > > 4 id1 x1 x2 x3 .... x20
> > > > >
> > > > > .........
> > > > >
> > > > > 348 id2 x1 x2 x3 .... x20
> > > > >
> > > > > 349 id2 x1 x2 x3 .... x20
> > > > >
> > > > > 350 id2 x1 x2 x3 .... x20
> > > > >
> > > > > 351 id2 x1 x2 x3 .... x20
> > > > >
> > > > > .........
> > > > >
> > > > >
> > > > >
> > > > > The data is sorted for the IDs (id1,id2 .....id40) and I like to
> > produce
> > > > 40
> > > > > heatmaps thereof, 1 heatmap per data corresponding to a single ID.
> >  The
> > > > data
> > > > > that has to be plotted is 20 values (x1 to x20). There is different
> > > > amounts
> > > > > of data for respective IDs. In the end I'd like to have the 40
> > heatmaps
> > > > > stacked on top of each other sorted by ID and heatmap heights
> > according
> > > > to
> > > > > the amount (number of rows) of data. Unfortunately the individual
> > data
> > > > lines
> > > > > have to be sorted with respect to the maximum of the values X1 to x20
> > in
> > > > > individual rows. Actually this not that important as I guess this
> > might
> > > > be
> > > > > easier to realize in upstream Perl scripts producing the data.
> > > > >
> > > > >
> > > > > The data is available as data per ID in individual files or as a
> > sorted
> > > > file
> > > > > with the complete dataset (as shown above).
> > > > >
> > > > >
> > > > > Is it possible in R to break a file as above into distinct blocks
> > > > (depending
> > > > > on ID) and then to process it (sorting according to maximum,
> > heatmap)?
> > > > >
> > > > >
> > > > > Which commands do I have to issue for the manipulation of the
> > data.frame?
> > > > I
> > > > > tried the
> > > > >
> > > > >
> > > > > I'd be glad  if someone could help me finding the correct direction
> > to
> > > > solve
> > > > > my problem!
> > > > >
> > > > >
> > > > > Best regards
> > > > >
> > > > >
> > > > > Maxim
> > > > >
> > > > >       [[alternative HTML version deleted]]
> > > > >
> > > > > _______________________________________________
> > > > > Bioconductor mailing list
> > i> > > Bioconductor at stat.math.ethz.ch
> > > > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > > > Search the archives:
> > > > http://news.gmane.org/gmane.science.biology.informatics.conductor
> > > > >
> > > >
> >