[R] Data-mining using R
gisar at nus.edu.sg
Fri May 9 05:00:49 CEST 2003
Yes all of this is possible in R and more.
You might find the which() command helpful for subsetting. You could
write a simple function to automate this. For graphing facilities, see
plot(), par(), postscript() etc.
In my opinion, it might not be worth the effort and time to save it to
MYSQL if you only want to perform a couple of queries. Plus R has
excellent graphing facilities. If you really want to automate the
process, then a combination of Perl and GNUplot seems like a good
combination. The choice depends on which software you are most
Another advantage R has is that it is an interactive language. So it is
great for exploratory analysis with minimum effort (unlike Excel in
which you spend 90% of your time dragging the mouse and sorting the
See the Bioconductor project, which focuses on genomic and expression
data and has many great functions specifically designed for microarray
etc. I doubt you will be able to find such vast collection of tools for
From: Fernando Henrique Ferraz Pereira da Rosa [mailto:mentus at gmx.de]
Sent: Friday, May 09, 2003 8:35 AM
To: r-help at stat.math.ethz.ch
Subject: [R] Data-mining using R
Is it possible to use R as a data-mining tool? Here's the problem
I've got. I have a couple of data sets consisting of results from a cDNA
microarray experiment - the details about the biology don't really
matter here, the same theory applies for any other data-mining task
(that's why I thought it'd be more appropriate to post this on r-user).
Each of these datasets consists of about 30000 rows by 20 to 30 columns.
Let's say that each row represents (very roughly speaking) a gene, and
the columns are details about its level of expression, reliability of
the measurament, coordinates and so on.
The main objetive here is identify some genes (rows) according to
some criteria. In order to do that, what I want to be able to do, is
selectively filter the rows, graph some convinient variables, do some
further filtering and so on.
Let me take a more concrete example to make myself clear. Let's
say that I load a given dataset on a dataframe, namely expr1. This
dataframe would have the fields expr1$name, expr1$expression,
expr1$reliablity, expr1$x, expr1$y and so on, containing, for instance,
26000 rows. Now from these 26000 I'd like to select only those ones
satisfying expr1$expression > 2000, expr1$reliability = 100 and plot a
graph on expr1$x x expr1$y, for them. I'd have then a reduced dataset of
the first one. Let's say now that I want to narrow my filter even more,
selecting only (among the ones I have already selected) the ones where
expr1$x > 20.
This would be done many times and in different orders. I'd like to
be able to, among those 26000 rows, take only the 100 whose expr$x are
the 100 greatest . And so on, many times, until I found a set of
What is the proper way to do that using R, if any? I've played a
little with dataframes (I could for instance use: expr1$names[expr1$x >
20] to get the names of those genes whose x > 20) but it seemed a little
clumsy. Should I keep trying to manipulate directly the dataframe, or
perhaps should I save it on a mysql database and do que queries using
RMYSql? Or maybe there is a better option?
I know that these things I've said are pretty easy to implement
using, for instance M$ Excel (I've seen them working on it). You just
select drop-down menus and filter the rows to your liking. But I really
would like to be able to accomplish this task using R and other open
source tools like MySql, Perl, etc.
Thank you in advance,
R-help at stat.math.ethz.ch mailing list
More information about the R-help