[R] data mining for R

Tony Plate tplate at blackmesacapital.com
Thu Sep 5 19:02:33 CEST 2002


At 04:16 PM 9/5/2002 +0200, "Philippe Grosjean" <phgrosje at ulb.ac.be> wrote:
>In the risk to be heavily critisized, one could mainly see data mining as a
>pseudo-new concept invented to sell new (and sometimes, expensive) software
>to industries. Data mining is nothing else than existing statistical
>analyses optimized for speed in order to deal with millions of entries, or
>even more, in a reasonable period of time.

Although not an unreasonable position, there are ways in which this is far 
from true.  One huge difference between data mining and statistical 
analysis is that in practice, most of the time and effort in data mining is 
spent getting the data into shape.  The sheer volume of data involved 
necessitates some sort of automated tools (which are sometimes based on 
statistics).  These tools are often custom built for the project.  The 
statistics is usually the easy part of data mining, and often the most 
appropriate statistical techniques are utterly-unnovel ones like linear 
regression.  (This is not to say that ensuring clean data is not an 
essential task in "traditional" statistical analysis, but the scale of the 
problem is usually quite different.)

This situation creates great opportunities for data-mining tools to 
help.  However, it is an extremely difficult problem because there are so 
many ways that raw data can be "wrong".  For example, most real-world large 
databases have some problems with ID's, which often makes matching 
assembling data records difficult and error prone (e.g., no unique id's 
available, people's names are misspelt, addresses change, different product 
id's used in different organizations, publicly-traded companies change 
their ticker symbols and die and merge and spinoff).  Dealing with these 
kinds of problems usually requires much domain knowledge -- I don't know of 
any general-purpose data-mining tools that can automatically fix ID-related 
problems.  Data value errors are another problem.  For example, the process 
of getting option prices into electronic databases sometimes involves a 
step where humans transcribe numbers manually, and humans are prone to 
transpose digits occasionally.  It's pretty easy to tell that in midst of 
$91 prices a price of "$19.31" is an error, but is a price of "$7.47" 
occurring in the midst of $7.70 prices a data error or large but true 
deviation?  Data value errors are generally easier to deal with the 
ID-related problems, but custom approaches are still often necessary to 
incorporate valuable domain knowledge.

-- Tony Plate

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list