[R] caret() train based on cross validation - split dataset to keep sites together?

Max Kuhn mxkuhn at gmail.com
Wed May 30 18:40:02 CEST 2012


Tyrell,

If you want to have the folds contain data from only one site at a
time, you can develop a set of row indices and pass these to the index
argument in trainControl. For example

   index = list(site1 = c(1, 6, 8, 12), site2 = c(120, 152, 176, 178),
site3 = c(754, 789, 981))

The first fold would fit a model on those site 1 data in the first
argument and predict everything else, and so on.

I'm not sure if this is what you need, but there you go.

Max

On Wed, May 30, 2012 at 7:55 AM, Tyrell Deweber <jtdeweber at gmail.com> wrote:
> Hello all,
>
> I have searched and have not yet identified a solution so now I am sending
> this message. In short, I need to split my data into training, validation,
> and testing subsets that keep all observations from the same sites together
> – preferably as part of a cross validation procedure. Now for the longer
> version. And I must confess that although my R skills are improving, they
> are not so highly developed.
>
> I am using 10 fold cross validation with 3 repeats in the train function of
> the caret() package to identify an optimal nnet (neural network) model to
> predict daily river water temperature at unsampled sites. I am also
> withholding data from 10% of sites to have a better understanding of
> generalization error. However, the focus on predictions at other sites is
> turning out to be not easily facilitated – as far as I can see.  My data
> structure (example at bottom of email) consists of columns identifying the
> site, the date, the water temperature on that day for the site (response
> variable), and many predictors.  There are over 220,000 individual
> observations at ~1,000 sites, and each site has a minimum of 30
> observations.  It is important to keep sites separate because selecting a
> model based on predictions at an already sampled site is likely
> overly-optimistic.
>
> Is there a way to split data for (or preferably during) cross validation
> procedure to:
>
> 1.) Selects a separate validation dataset from 10% of sites
> 2.) Splits remaining training data into cross validation subsets and most
> importantly, keeping all observations from a site together
> 3.) Secondarily, constrain partitions to be similar - ideally based on
> distributions of all variables
>
> It seems that some combination of the sample.split function of the caTools()
> package and the createdataPartition function of caret() might do this, but I
> am at a loss for how to code that.
>
> If this is not possible, I would be content to skip the cross validation
> procedure and create three similar splits of my data that keep all
> observations from a site together – one for training, one for testing, and
> one for validation.  The alternative goal here would be to split the data
> where 80% of sites are training, 10% of sites are for testing (model
> selection), and 10% of sites for validation.
>
> Thank you and please let me know if there are any remaining questions.  This
> is my first post as well, so if I left anything out that would be good to
> know as well.
>
> Tyrell Deweber
>
>
>
> R version 2.13.1 (2011-07-08)
> Copyright (C) 2011 The R Foundation for Statistical Computing
> ISBN 3-900051-07-0
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> Comid   tempymd            watmntemp       airtemp         predictorb    …
> 15433    1980-05-01          11.4          22.1                 …
> 15433    1980-05-02          11.6          23.6                 …
> 15433    1980-05-03          11.2          28.5
> 15687    1980-06-01          13.5          26.5
> 15687    1980-06-02          14.2          26.9
> 15687    1980-06-03          13.8          28.9
> 18994    1980-04-05          8.4           16.4
> 18994    1980-04-06          8.3           12.6
> 90342    1980-07-13          18.9          22.3
> 90342    1980-07-14          19.3          28.4
>
>
> EXAMPLE SCRIPT FOR MODEL FITTING
>
>
> fitControl <- trainControl(method = "repeatedcv", number=10, repeats=3)
>
> tuning <- read.table("temptunegrid.txt",head=T,sep=",")
> tuning
>
>
> # # Model with 100 iterations
> registerDoMC(4)
> tempmod100its <- train(watmntemp~tempa + tempb + tempc + tempd + tempe +
> netarea + netbuffor + strmslope +
>        netsoilprm + netslope + gwndx + mnaspect + urb + ag + forest +
> buffor + tempa7day + tempb7day +
>        tempc7day + tempd7day + tempe7day +  tempa30day + tempb30day +
> tempc30day + tempd30day +
>        tempe30day, data = temp.train, method = "nnet", linout=T, maxit =
> 100,
>        MaxNWts = 100000, metric = "RMSE", trControl = fitControl, tuneGrid
> = tuning, trace = T)
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 

Max



More information about the R-help mailing list