[R] Data-mining using R

Fri May 9 04:27:46 CEST 2003

See www.bioconductor.org for one reasonably full featured approach.

There are others (Rmaanova, etc, etc).

Fernando Henrique Ferraz Pereira da Rosa <mentus at gmx.de> writes:

>       Is it possible to use R as a data-mining tool? Here's the problem I've
> got. I have a couple of data sets consisting of results from a cDNA
> microarray experiment - the details about the biology don't really matter here, the
> same theory applies for any other data-mining task (that's why I thought it'd
> be more appropriate to post this on r-user).  Each of these datasets consists
> of about 30000 rows by 20 to 30 columns. Let's say that each row represents
> (very roughly speaking) a gene, and the columns are details about its level
> of expression, reliability of the measurament, coordinates and so on.
>       The main objetive here is identify some genes (rows) according to some
> criteria. In order to do that, what I want to be able to do, is selectively
> filter the rows, graph some convinient variables, do some further filtering
> and so on.
>       Let me take a more concrete example to make myself clear. Let's say
> that I load a given dataset on a dataframe, namely expr1. This dataframe would
> have the fields expr1$name, expr1$expression, expr1$reliablity, expr1$x,
> expr1$y and so on, containing, for instance, 26000 rows. Now from these 26000 I'd
> like to select only those ones satisfying expr1$expression > 2000,
> expr1$reliability = 100 and plot a graph on expr1$x x expr1$y, for them. I'd have then
> a reduced dataset of the first one. Let's say now that I want to narrow my
> filter even more, selecting only (among the ones I have already selected) the
> ones where expr1$x > 20.
>       This would be done many times and in different orders. I'd like to be
> able to, among those 26000 rows, take only the 100 whose expr$x are the 100
> greatest
> . And so on, many times, until I found a set of suitable rows.
>       What is the proper way to do that using R, if any? I've played a
> little with dataframes (I could for instance use: expr1$names[expr1$x > 20] to get
> the names of those genes whose x > 20) but it seemed a little clumsy. Should
> I keep trying to manipulate directly the dataframe, or perhaps should I save
> it on a mysql database and do que queries using RMYSql? Or maybe there is a
> better option?
>       I know that these things I've said are pretty easy to implement using,
> for instance M$ Excel (I've seen them working on it). You just select
> drop-down menus and filter the rows to your liking. But I really would like to be
> able to accomplish this task using R and other open source tools like MySql,
> Perl, etc.
>       
>
> Thank you in advance,
>
> --
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
>

-- 
A.J. Rossini rossini at u.washington.edu http://software.biostat.washington.edu/ 
Biostatistics, U Washington and Fred Hutchinson Cancer Research Center

FHCRC:Tu: 206-667-7025 (fax=4812)|Voicemail is pretty sketchy/use Email 
UW  : Th: 206-543-1044 (fax=3286)|Change last 4 digits of phone to FAX 

CONFIDENTIALITY NOTICE: This e-mail message and any attachments ... {{dropped}}