[R] Chi square test on data frame

Bansal, Vikas vikas.bansal at kcl.ac.uk
Wed Aug 17 21:07:43 CEST 2011


Dear Michael,

Thanks a lot for your reply and for your help.I was struggling so much but your suggestion showed me a path to the solution of my problem.I have tried your code on my data frame step wise and it looks fine to me.But when i tried chi square test-

res=chisq.test(y1[id],p=y2[id],rescale.p=T)

        Chi-squared test for given probabilities

data:  y1[id] 
X-squared = NaN, df = 19997, p-value = NA

Warning message:
In chisq.test(y1[id], p = y2[id], rescale.p = T) :
  Chi-squared approximation may be incorrect

It is not giving p value.Then i checked observed and expected values,it is taking all numbers under consideration.but as i mentioned earlier i want p value for each row and therefore degree of freedom will be 1. example-

I have a data frame with 8 columns-
      V1   V2       V3       V4      W1   W2        W3       W4
1     0    84       22       10       0      84          0          0
2    35    84        0        0     22      84          0          0
3     0     0          0      48       0       0            0         48
4     0    48        0        0       0      48           0          0
5     0    84        0        0       0      84           0          0
6     0     0        0       48       0       0            0         48

example for first row is-

first two largest values are 84(in V2) and 22 (in V3).so these are considered as observed values.Now if the largest values are in V2 and V3,we have to pick expected values from W2 and W3 which are 84 and 0.I know for chi square test values should not be 0 but we will ignore the warning.

now it should generate p value for next row taking 35 and 84 (v1 and v2) as observed and 22 and 84 (w1 and w2) as expected.so here it will do chi square test for all 6 rows and will generate 6 p values.My data frame has lot of rows(approx. 9999).

Can you please help me with this.



Thanking you,
Warm Regards
Vikas Bansal
Msc Bioinformatics
Kings College London
________________________________________
From: R. Michael Weylandt [michael.weylandt at gmail.com]
Sent: Wednesday, August 17, 2011 7:11 PM
To: Bansal, Vikas
Cc: r-help at r-project.org
Subject: Re: [R] Chi square test on data frame

I think everything below is right, but it's all a little helter-skelter so take it with a grain of salt:

First things first, make your data with dput() for the list.

Y = structure(c(0, 35, 0, 0, 0, 0, 84, 84, 0, 48, 84, 0, 22, 0, 0,
0, 0, 0, 10, 0, 48, 0, 0, 48, 0, 22, 0, 0, 0, 0, 84, 84, 0, 48,
84, 0, 0, 0, 0, 0, 0, 0, 0, 0, 48, 0, 0, 48), .Dim = c(6L, 8L
), .Dimnames = list(c("1", "2", "3", "4", "5", "6"), c("V1",
"V2", "V3", "V4", "W1", "W2", "W3", "W4")))

Now,

Y1 = Y[,1:4]
Y2 = Y[,-(1:4)]

id = apply(Y1,1,order,decreasing=T)[1:2,]
# This has the columns you want in each row, but it's not directly appropriate for subsetting
# Specifically, the problem is that the row information is implicit in where the col index is in id
# We directly extract and force into a 2-col vector that gives rows and columns for each data point
id = cbind(as.vector(col(id)),as.vector(id))

Now you can take

Y1[id] as the observed values and Y2[id] as the expected.

But, to be honest, it sounds like you have more problems in using a chi-sq test than anything else. Beyond all the zeros, you should note that you always have #obs >= #expected because Y1>= Y2. I'll leave that up to you though.

Hope this helps and please make sure you can take my code apart piece by piece to understand it: there's some odd data manipulation that takes advantage of R's way of coercing matrices to vectors and if your actual data isn't like the provided example, you may have to modify.

Michael Weylandt

On Wed, Aug 17, 2011 at 10:26 AM, Bansal, Vikas <vikas.bansal at kcl.ac.uk<mailto:vikas.bansal at kcl.ac.uk>> wrote:
Is there anyone who can help me with chi square test on data frame.I am struggling from last 2 days.I will be very  thankful to you.

Dear all,

I have been working on this problem from so many hours but did not find any solution.
I have a data frame with 8 columns-
      V1   V2       V3       V4      W1   W2        W3       W4
1     0    84       22       10       0      84          0          0
2    35    84        0        0     22      84          0          0
3     0     0          0      48       0       0            0         48
4     0    48        0        0       0      48           0          0
5     0    84        0        0       0      84           0          0
6     0     0        0       48       0       0            0         48

from first four columns, for each row I have to take two largest values. and these two values will be considered as observed values.And from last four column we will get the expected values.So i have to perform chi square test for each row to get p values.

example for first row is-

first two largest values are 84(in V2) and 22 (in V3).so these are considered as observed values.Now if the largest values are in V2 and V3,we have to pick expected values from W2 and W3 which are 84 and 0.I know for chi square test values should not be 0 but we will ignore the warning.
Now as we have observed value as well as expected we have to perform chi square test to get p values for each row in a new column.


So far I was working as returning the index for two largest value with-
sort.int<http://sort.int>(df,index.return=TRUE)$ix[c(4,3)]
 but it does not accept data frame.

Can you please give some idea how to do this,because it is very tricky and after studying a lot, I am not able to perform.Please help.



Thanking you,
Warm Regards
Vikas Bansal
Msc Bioinformatics
Kings College London
______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org<mailto:R-help at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list