[R] Subsampling-oversampling from a data frame
B77S
bps0002 at auburn.edu
Wed Nov 2 05:53:07 CET 2011
# Perhaps I misunderstand your original need, but....
## I added a few lines to your data and used dput() to get the below data (I
named "df")
df<- structure(list(age = c(15L, 20L, 15L, 10L, 10L, 12L, 17L, 17L,
11L, 12L, 16L, 20L, 23L, 14L, 22L, 16L, 10L, 11L, 21L, 10L, 13L,
17L), sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("f",
"m"), class = "factor"), class = structure(c(2L, 1L, 2L, 2L,
2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L,
2L, 1L), .Label = c("high", "low"), class = "factor")), .Names = c("age",
"sex", "class"), class = "data.frame", row.names = c(NA, -22L
))
## the following line uses which(), sample(), and rbind(), along with some
indexing to get a new dataframe; see ?which, ?sample, and ?rbind for more
info
# For the "indexing", play with it, ... type in df[1:3,1:2] as an example
new_df <- rbind(df[sample(which(df$class=="low"), 4),],
df[sample(which(df$class=="high"), 4),])
Now replace 4 with the the size of each you want.
hgwelec wrote:
>
> Thank you for your answer.
>
> The problem is that i am learning R now, so i do not know how i could do
> this.
>
>
> I have found the following code but it does not work unfortunately
> (=create distribution 0.1 "low" class - 0.9 high) :
>
>
>
> data[c(rownames(data.df[data.df$class=="high",]),
> sample(rownames(data[data.df$class=="low"]), 0.1)) , ]
>
2 posts
This post has NOT been accepted by the mailing list yet.
Dear members,
Consider the following data frame (first 4 rows shown)
age sex class
15 m low
20 f high
15 f low
10 m low
in my original data set i have 1200 rows and a class distribution of low=0.3
and high=0.7
My question : how can i create a new data frame as the one shown above but
with the 'high' class subsampled so that in the new data frame the class
distribution is low=0.5 and high=0.5?
I tried looking at the sample function and prob option but all examples i
seen do not use an imbalanced class problem as the one shown above
Thank you in advance
Thank you in advance
--
View this message in context: http://r.789695.n4.nabble.com/Subsampling-oversampling-from-a-data-frame-tp3965771p3971840.html
Sent from the R help mailing list archive at Nabble.com.
More information about the R-help
mailing list