[R] Splitting data

Marina de Wolff marinadewolff at hotmail.com
Fri Aug 12 17:19:47 CEST 2011


I understand the confusion, I hope I can clearify this. The problem you are refering to will not apply for my data I think.
 
My data consists of 40.000 points, of pressure at a gaswell (vs time). I included a picture. 
The problem with this data set is that only datapoints in a 'stable' situation are reliable.
Therefore the dataset needs to be filtered before it is useable. I'm trying different ideas to fulfill that goal.

I already used breakpoints and some sort of steady state detection with moving variance.
An other idea would be to split the data in half, compare with each other and if both groups (first half of the data and second half of the data)
significantly differ split again and compare both left groups with each other and both right groups with each other etc. 
As a final result I would have different groups with different lengths (I hope), and only use the groups with a minimum size of m.
 
Many thanks in advance for your assitance in this.
 
Sincerely,
 
Marina de Wolff

 



From: michael.weylandt at gmail.com
Date: Fri, 12 Aug 2011 09:35:00 -0400
Subject: Re: [R] Splitting data
To: marinadewolff at hotmail.com
CC: r-help at r-project.org

Yes, that likely is the source of the difference: I'm happy to help fix it up (won't be hard), but I want to clarify exactly how you want the data done: 

say we have 20 variables x = 1:20 if there's a split we go to 1:10, 11:20; then 1:5, 6:10, 11:15,16:20 etc

but what about situations with very different data sets: 

x = cbind(1:20, 1:7)
one split takes us to where exactly: cbind( c(1:10, 11:20), c(1:3,1:4)) or cbind( c(1:10,11:20), c(1:4,5:7)) and then what of the next iteration? 

More generally, what exactly are you comparing? It seems odd to have two different categories/samples and to compare their means and then to switch gears entirely to compare subsamples of the categories independently. It seems that they are just different inferences: comparing the average of cats vs dogs and then comparing boy cats vs girl cats and boy dogs vs girl dogs. That winds up highlighting different independent variables. (Iteration one: species --> iteration two: gender)

If you could speak a little more about your data, it'd be easier to do the splits in a meaningful way. 

As currently implemented, my code takes a 2d data frame and simply divides it into the top and bottom halves, which in most applications would corresponding to doing a mean-comparison calculation for different statistics of the same observation. The subsetting then keeps "corresponding" data together -- I put corresponding in parentheses because we aren't doing paired t-tests. 

Looking forward to your reply,

Michael

PS -- I did the splits basically the same way (other than the direction) but I just used floor() instead of round(). 



On Fri, Aug 12, 2011 at 3:45 AM, Marina de Wolff <marinadewolff at hotmail.com> wrote:



Thank you for your reply,
 
I used this code on my test data, but did not get the same p-values. 
 
I think I know were the difference lies; when the data is split in 4 parts I want to compare the two left groups (group 1 and 2) with each other and the two right groups (group 3 and 4) with each other. It seems that with this code group 1 and 3 are compared with each other and group 2 and 4, I did not yet succeeded in changing this. 
 
About the unequal data sizes, I thought I could 'correct' this by using round. For example, when my data consists of 17 data points I would use
 
m <- length(data)/2
x <- data[1:round(m)]
y <- data[(round(m)+1):length(data)]
 
x has size 9 and y has size 8. 
 

Sincerely,
Marina de Wolff

 




From: michael.weylandt at gmail.com
Date: Thu, 11 Aug 2011 11:54:11 -0400
Subject: Re: [R] Splitting data
To: marinadewolff at hotmail.com
CC: r-help at r-project.org




This sounds very much like a recursive problem: something like this seems to get the gist of what you want. 

DataSplits <- function(Data, alpha = 0.05) {
    DataSplitsCore <- function(Data, alpha, level) {
        tt <- t.test(Data[,1],Data[,2])
        print(tt)
        if (tt$p.value > alpha) {
            print(paste("Stopped at level", level))
            return(invisible(TRUE))
        } else {
            nr = floor(NROW(Data)/2)
            if (nr == 1) {print(paste("Reached Samples of Size 1")); stop}
            d1 = DataSplitsCore(Data[(1:nr),], alpha = alpha, level = level + 1)
            if (d1) return(invisible(TRUE))
            d2 = DataSplitsCore(Data[-(1:nr),], alpha = alpha, level = level +1)
            if (d2) return(invisible(TRUE))
            return(invisible(FALSE))
        }
    }
    DataSplitsCore(Data, alpha = alpha, level = 1)
}

Your description wasn't the clearest about what to do when the data sizes didn't match, but this should give you a start. Let me know if this doesn't do as desired and I can help tweak it. 

Hope this can be of help, 

Michael Weylandt 

PS -- You might as well use R's built in t.test function. 


On Thu, Aug 11, 2011 at 5:17 AM, Marina de Wolff <marinadewolff at hotmail.com> wrote:


I want to implement the following algorithm in R:

I want to split my data, use a t test to compare both means of the groups to see if they significantly differ from each other. If this is a yes (p < alpha) I want to split again (into 4 groups) and do the same procedure twice,  and stop otherwise (here the problem arises). As a final result I would have different groups of data.

I made some code where the data is splitted, until no splitting is possible. So for 16 datapoints, we can split 4 times with a final result of 16 groups (p is NA for the 4th split since sd cannot be calculated..).

The code calculated all p values, but I don't want this. I want it to stop when p > alpha. I tried while, but didn't succeed.

I hope someone can help me to acchieve my goal.

This is what I tried so far with test data:

a = rnorm(9,0,0.1)
b = rnorm(7,1,0.1)
data = c(a,b)
plot(data)

# Want to calculate max of groups/split for the data
d = seq(1,100,1)
n = 2^d
m <- which(n <=length(data))
n = n[m[1]:m[length(m)]]

# All groups
i=0
j=0
dx = 0
dy =
for (i in 1:length(n)){
split <- length(data)/(n[i])
for (j in 1:(n[i]/2)){
x = data[(1 + (j-1)*(2*split)):(round(split) + (j-1)*(2*split))]
dx = cbind(dx,x)
y = data[((round(split)+1) + (j-1)*(2*split)):(2*j*split)]
dy = cbind(dy,y)
}}

dx = dx[,2:dim(dx)[2]]
dy = dy[,2:dim(dy)[2]]

k=0
meanx=0
meany=0
sdx=0
sdy=0
nx=0
ny=0
for (k in 1:dim(dx)[2]) {
meanx[k] = mean(unique(dx[,k]))
meany[k] = mean(unique(dy[,k]))
sdx[k] = sd(unique(dx[,k]))
sdy[k] = sd(unique(dy[,k]))
nx[k] = length(unique(dx[,k]))
ny[k] = length(unique(dy[,k]))
}

t = (meanx-meany)/sqrt((sdx^2/nx) + (sdy^2/ny))
df = ((sdx^2/nx) + (sdy^2/ny))^2/((sdx^2/nx)^2/(nx-1) + (sdy^2/ny)^2/(ny-1))
p = 2*pt(-abs(t),df=df)
alpha = 0.05
       [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


 		 	   		  
-------------- next part --------------
A non-text attachment was scrubbed...
Name: data.pdf
Type: application/pdf
Size: 2624333 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110812/e1eeee27/attachment-0001.pdf>


More information about the R-help mailing list