[R] data mining for R
Tony Plate
tplate at blackmesacapital.com
Thu Sep 5 19:02:33 CEST 2002
At 04:16 PM 9/5/2002 +0200, "Philippe Grosjean" <phgrosje at ulb.ac.be> wrote:
>In the risk to be heavily critisized, one could mainly see data mining as a
>pseudo-new concept invented to sell new (and sometimes, expensive) software
>to industries. Data mining is nothing else than existing statistical
>analyses optimized for speed in order to deal with millions of entries, or
>even more, in a reasonable period of time.
Although not an unreasonable position, there are ways in which this is far
from true. One huge difference between data mining and statistical
analysis is that in practice, most of the time and effort in data mining is
spent getting the data into shape. The sheer volume of data involved
necessitates some sort of automated tools (which are sometimes based on
statistics). These tools are often custom built for the project. The
statistics is usually the easy part of data mining, and often the most
appropriate statistical techniques are utterly-unnovel ones like linear
regression. (This is not to say that ensuring clean data is not an
essential task in "traditional" statistical analysis, but the scale of the
problem is usually quite different.)
This situation creates great opportunities for data-mining tools to
help. However, it is an extremely difficult problem because there are so
many ways that raw data can be "wrong". For example, most real-world large
databases have some problems with ID's, which often makes matching
assembling data records difficult and error prone (e.g., no unique id's
available, people's names are misspelt, addresses change, different product
id's used in different organizations, publicly-traded companies change
their ticker symbols and die and merge and spinoff). Dealing with these
kinds of problems usually requires much domain knowledge -- I don't know of
any general-purpose data-mining tools that can automatically fix ID-related
problems. Data value errors are another problem. For example, the process
of getting option prices into electronic databases sometimes involves a
step where humans transcribe numbers manually, and humans are prone to
transpose digits occasionally. It's pretty easy to tell that in midst of
$91 prices a price of "$19.31" is an error, but is a price of "$7.47"
occurring in the midst of $7.70 prices a data error or large but true
deviation? Data value errors are generally easier to deal with the
ID-related problems, but custom approaches are still often necessary to
incorporate valuable domain knowledge.
-- Tony Plate
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the R-help
mailing list