[R] Subsampling-oversampling from a data frame

B77S bps0002 at auburn.edu
Wed Nov 2 05:53:07 CET 2011


# Perhaps I misunderstand your original need, but....


## I added a few lines to your data and used dput() to get the below data (I
named "df")

df<- structure(list(age = c(15L, 20L, 15L, 10L, 10L, 12L, 17L, 17L, 
11L, 12L, 16L, 20L, 23L, 14L, 22L, 16L, 10L, 11L, 21L, 10L, 13L, 
17L), sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("f", 
"m"), class = "factor"), class = structure(c(2L, 1L, 2L, 2L, 
2L, 1L, 1L, 2L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 2L, 2L, 1L, 1L, 
2L, 1L), .Label = c("high", "low"), class = "factor")), .Names = c("age", 
"sex", "class"), class = "data.frame", row.names = c(NA, -22L
))

## the following line uses which(), sample(), and rbind(), along with some
indexing to get a new dataframe; see ?which, ?sample, and ?rbind for more
info
# For the "indexing", play with it, ... type in df[1:3,1:2] as an example

new_df <- rbind(df[sample(which(df$class=="low"), 4),],
df[sample(which(df$class=="high"), 4),])

Now replace 4 with the the size of each you want.




hgwelec wrote:
> 
> Thank you for your answer.
> 
> The problem is that i am learning R now, so i do not know how i could do
> this.
> 
> 
> I have found the following code but it does not work unfortunately
> (=create distribution 0.1 "low" class - 0.9 high) :
> 
> 
> 
> data[c(rownames(data.df[data.df$class=="high",]),
> sample(rownames(data[data.df$class=="low"]), 0.1)) , ]
> 





2 posts
This post has NOT been accepted by the mailing list yet.
Dear members, 

Consider the following data frame (first 4 rows shown) 


  age sex class 
  15   m   low 
  20   f  high 
  15   f   low 
  10   m   low 

in my original data set i have 1200 rows and a class distribution of low=0.3
and high=0.7 


My question : how can i create a new data frame as the one shown above but
with the 'high' class subsampled so that in the new data frame the class
distribution is low=0.5 and high=0.5? 

I tried looking at the sample function and prob option but all examples i
seen do not use an imbalanced class problem as the one shown above 


Thank you in advance 


Thank you in advance   



--
View this message in context: http://r.789695.n4.nabble.com/Subsampling-oversampling-from-a-data-frame-tp3965771p3971840.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list