[R] Analogues to my data and prediction problem
Ben Harrison
harb at student.unimelb.edu.au
Mon Aug 26 07:50:43 CEST 2013
Hello, I am quite a novice when it comes to predictive modelling, so
would like to see where my particular problem might lie in the spectrum
of problems that you collectively have seen in your experiences.
Background: I have been handed a piece of software that uses a kohonen
SOM network to analyse and predict data with missing values common, but
I want to compare its results to other forms of modelling and prediction
(e.g. multi-layer perceptrons, random forests??).
My data is a conglomeration of borehole data from hundreds of boreholes.
Some measurements were made during the drilling of the boreholes (more
or less continuous 'tool responses': geophysical well-logs), and some in
the laboratory on discrete samples of 10 cm up to metre-length scales.
The data could be considered ordered series to some extent, though
changes in rock types with depth can result in 'step' changes in tool
responses.
My problem is not classifying the rocks, but modelling and predicting a
physical attribute of the rocks---thermal conductivity, which is a lab
measurement, and hard to come by / expensive. I want to use the more
common well-log responses to predict this attribute.
Some boreholes have different sets of well-log data though. For example,
one might have measurements from the A and B tool, while another might
have A, B, and C tools, and a third the B and C tools. I can construct a
decent data base of about 70,000 observations of a common set of 5 tool
responses, and they have associated with them about 100 measurements of
thermal conductivity. I am mostly confident that the relationship of
well-log responses is non-linear to thermal conductivity. Linear
regression has not proven accurate.
What 'sort' of problem is this?
Have you seen problems like this, and what did you use to solve it?
I have papers by people using other ANN type techniques (MLP in
particular) to model and predict thermal conductivity, but wondered if
there was something else I could try.
Some other questions I would like a little guidance on:
Are 100 samples enough of the 'target' attribute for confident modelling
and prediction?
How would I quantify the certainty of results of modelling?
The well-log data is extensive, but if I look at the complete set of
tool responses, there is a LOT of missing data (because there is no
common tool set). Is there a way I can still use the less common tool
responses?
Is discretisation of the 100 measured thermal conductivities a silly
idea? How many 'bins' can I construct?
Thanks for reading!
Ben.
More information about the R-help
mailing list