[R] Subsetting data for split-sample validation, then repeating 1000x

Fri Aug 22 23:18:47 CEST 2014

Combine your code into a function:

Plant <- function() {
    train <- sample.int(nrow(A), floor(nrow(A)*.7))
    test <- (1:nrow(A))[-train]
    A.model <- glmmadmb(nat.r ~ isl.sz + nr.mead, random = ~ 1 | site, family =
        "poisson", data = A[train,])
    cor(Atest$nat.r, predict(A.model, newdata = A[test,], type = "response"))
}

Test the function. It should return a single correlation and no errors or warnings.

Plant()

If not, debug and run it again. When it works:

Out <- replicate(1000, Plant())

Out should be a vector with 1000 correlation values.
hist(Out) # for a histogram of the correlation values

David C

From: Angela Boag [mailto:Angela.Boag at Colorado.EDU] 
Sent: Friday, August 22, 2014 4:01 PM
To: David L Carlson
Subject: Re: [R] Subsetting data for split-sample validation, then repeating 1000x

Hi David,
Thanks for the feedback. I actually sampled without replacement initially but it's been a while since I looked at this code and just changed it because I thought it made more sense logically, but you've reassured me that my original hunch was right.
The real issue I'm having is how to use either the replicate() or for(i in 1:1000){} loop code to get the average r value of 1000 repetitions as my output. I'm not familiar with either tool, so any suggestions on what that code would look like would be very helpful.

Thanks!
Angela 

--
Angela E. Boag
Ph.D. Student, Environmental Studies
CAFOR Project Researcher
University of Colorado, Boulder
Mobile: 720-212-6505

On Fri, Aug 22, 2014 at 2:46 PM, David L Carlson <dcarlson at tamu.edu> wrote:
You can use replicate() or a for (i in 1:1000){} loop to do your replications, but you have other issues first.

1. You are sampling with replacement which makes no sense at all. Your 70% sample will contain some observations multiple times and will use less than 70% of the data most of the time.

2. You compute r using cor() and r.squared using summary.lm(). Why? Once you have computed r, r*r or r^2 is equal to r.squared for the simple linear model you are using.

# To split your data, you need to sample without replacement, e.g.

train <- sample.int(nrow(A), floor(nrow(A)*.7))
test <- (1:nrow(A))[-train]

# Now run your analysis on A[train,] and test it on A[test,]

# Fit model (I'm modeling native plant richness, 'nat.r')
A.model <- glmmadmb(nat.r ~ isl.sz + nr.mead, random = ~ 1 | site, family =
"poisson", data = A[train,])

# Correlation between predicted 30% and actual 30%
cor <- cor(Atest$nat.r, predict(A.model, newdata = A[test,], type = "response"))

-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Angela Boag
Sent: Thursday, August 21, 2014 4:46 PM
To: r-help at r-project.org
Subject: [R] Subsetting data for split-sample validation, then repeating 1000x

Hi all,

I'm doing some within-dataset model validation and would like to subset a
dataset 70/30 and fit a model to 70% of the data (the training data), then
validate it by predicting the remaining 30% (the testing data), and I would
like to do this split-sample validation 1000 times and average the
correlation coefficient and r2 between the training and testing data.

I have the following working for a single iteration, and would like to know
how to use either the replicate() or for-loop functions to average the 1000
'r2' and 'cor' outputs.

--

# create 70% training sample
A.samp <- sample(1:nrow(A),floor(0.7*nrow(A)), replace = TRUE)

# Fit model (I'm modeling native plant richness, 'nat.r')
A.model <- glmmadmb(nat.r ~ isl.sz + nr.mead, random = ~ 1 | site, family =
"poisson", data = A[A.samp,])

# Use the model to predict the remaining 30% of the data
A.pred <- predict(A.model, newdata = A[-A.samp,], type = "response")

# Correlation between predicted 30% and actual 30%
cor <- cor(A[-A.samp,]$nat.r, A.pred, method = "pearson")

# r2 between predicted and observed
lm.A <- lm(A.pred ~ A[-A.samp,]$nat.r)
r2 <- summary(lm.A)$r.squared

# print values
r2
cor

--

Thanks for your time!

Cheers,
Angela

--
Angela E. Boag
Ph.D. Student, Environmental Studies
CAFOR Project Researcher
University of Colorado, Boulder
Mobile: 720-212-6505
        [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.