[R] Urgent Help needed
Marc Schwartz
marc_schwartz at comcast.net
Thu Aug 16 21:02:32 CEST 2007
On Thu, 2007-08-16 at 12:33 -0400, AbouEl-Makarim Aboueissa wrote:
> Dear All:
>
> Urgent help is needed.
>
>
> I have a data set in matrix format of three columns: X, Y and index
> of four groups (1,2,3,4). What I need to do is the following;
>
> 1- How I can subtract the sample mean of each group indexed 1,2,3,4
> from the
> corresponding data values of this group and create new columns
> say X-sample mean
> and Y-sample mean? I tried to use the "tapply" but I have some
> difficulties to restore the new data
>
>
> 2- How I can use the “tapply” if possible or any other R-function to
> find the correlation
> coefficient between the X and Y columns for each group indexed
> 1,2,3,4.? Could not use the "tapply".
>
>
> I attached part of the data as txt file.
>
>
> Thank you so much for your attention to this matter, and I look
> forward to hear from you soon.
>
> Regards,
>
> Abou
>
>
> Data:
> ====
> x y index
> 15807.24 12.5 4
> 15752.51 33.5 4
> 12893.76 01.5 3
> 8426.88 22.2 3
> 5706.24 333 3
> 3982.08 560 2
> 3642.62 670 2
> 295.68 124 1
> 215.40 104 1
> 195.40 204 1
> 4240.21 22.4 2
> 1222.72 45.9 2
> 1142.26 23.6 2
> 63.00 90.1 1
> 1216.00 82.4 2
> 2769.60 111 2
> 1790.46 34.7 2
> 26.10 26.10 1
> 19676.83 0.99 4
> 10920.60 203 3
> 6144.00 46 3
> 4534.48 4534.48 3
> 40000.00 65 4
> 29500.00 56 4
> 17100.00 77 4
> 9000.00 435 3
> 6300.00 84 3
> 3962.88 334 2
> 5690.00 653 3
> 3736.00 233 2
> 2750.00 22 2
> 1316.00 345 2
> 4595.00 4595.00 3
> 5928.00 45 3
> 2645.70 0.00 2
> 2580.24 454 2
> 6547.34 6547.34 3
> 1615.68 5 2
> 194.06 55 1
> 184.80 6 1
> 82.94 44 1
> 16649.00 56 4
> 4500.00 74 3
> 1600.00 744 2
>
> =================
I might be tempted to take the following approach:
If your data is a matrix, coerce it to a data frame first. Let's call
that 'DF'.
> str(DF)
'data.frame': 44 obs. of 3 variables:
$ x : num 15807 15753 12894 8427 5706 ...
$ y : num 12.5 33.5 1.5 22.2 333 560 670 124 104 204 ...
$ index: int 4 4 3 3 3 2 2 1 1 1 ...
Now use split() to break up the data frame into a list of 4
sub-dataframes, based upon the index value. We can use scale() within a
lapply() loop to center the 'x' and 'y' columns for each sub-dataframe:
DF.ctr <- lapply(split(DF[, -3], DF$index), scale, scale = FALSE)
> str(DF.ctr)
List of 4
$ 1: num [1:8, 1:2] 138.5 58.2 38.2 -94.2 -131.1 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:8] "8" "9" "10" "14" ...
.. ..$ : chr [1:2] "x" "y"
..- attr(*, "scaled:center")= Named num [1:2] 157.2 81.7
.. ..- attr(*, "names")= chr [1:2] "x" "y"
$ 2: num [1:16, 1:2] 1469 1129 1727 -1291 -1371 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:16] "6" "7" "11" "12" ...
.. ..$ : chr [1:2] "x" "y"
..- attr(*, "scaled:center")= Named num [1:2] 2513 230
.. ..- attr(*, "names")= chr [1:2] "x" "y"
$ 3: num [1:13, 1:2] 5879 1413 -1308 3906 -870 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:13] "3" "4" "5" "20" ...
.. ..$ : chr [1:2] "x" "y"
..- attr(*, "scaled:center")= Named num [1:2] 7014 1352
.. ..- attr(*, "names")= chr [1:2] "x" "y"
$ 4: num [1:7, 1:2] -6262 -6317 -2393 17931 7431 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:7] "1" "2" "19" "23" ...
.. ..$ : chr [1:2] "x" "y"
..- attr(*, "scaled:center")= Named num [1:2] 22069 43
.. ..- attr(*, "names")= chr [1:2] "x" "y"
Now, create a new single DF comprised of the sub-dataframes from DF.ctr:
DF.new <- do.call(rbind, DF.ctr)
Define colnames:
colnames(DF.new) <- c("x-mean", "y-mean")
> str(DF.new)
num [1:44, 1:2] 138.5 58.2 38.2 -94.2 -131.1 ...
- attr(*, "dimnames")=List of 2
..$ : chr [1:44] "8" "9" "10" "14" ...
..$ : chr [1:2] "x-mean" "y-mean"
Now, use merge() to join DF and DF.new by the rownames:
DF.final <- merge(DF, DF.new, by = "row.names")
> DF.final
Row.names x y index x-mean y-mean
1 1 15807.24 12.50 4 -6262.12857 -30.498571
2 10 195.40 204.00 1 38.22750 122.350000
3 11 4240.21 22.40 2 1726.93188 -208.037500
4 12 1222.72 45.90 2 -1290.55812 -184.537500
5 13 1142.26 23.60 2 -1371.01812 -206.837500
6 14 63.00 90.10 1 -94.17250 8.450000
7 15 1216.00 82.40 2 -1297.27812 -148.037500
8 16 2769.60 111.00 2 256.32188 -119.437500
9 17 1790.46 34.70 2 -722.81812 -195.737500
10 18 26.10 26.10 1 -131.07250 -55.550000
11 19 19676.83 0.99 4 -2392.53857 -42.008571
12 2 15752.51 33.50 4 -6316.85857 -9.498571
13 20 10920.60 203.00 3 3906.26923 -1148.809231
14 21 6144.00 46.00 3 -870.33077 -1305.809231
15 22 4534.48 4534.48 3 -2479.85077 3182.670769
16 23 40000.00 65.00 4 17930.63143 22.001429
17 24 29500.00 56.00 4 7430.63143 13.001429
18 25 17100.00 77.00 4 -4969.36857 34.001429
19 26 9000.00 435.00 3 1985.66923 -916.809231
20 27 6300.00 84.00 3 -714.33077 -1267.809231
21 28 3962.88 334.00 2 1449.60188 103.562500
22 29 5690.00 653.00 3 -1324.33077 -698.809231
23 3 12893.76 1.50 3 5879.42923 -1350.309231
24 30 3736.00 233.00 2 1222.72188 2.562500
25 31 2750.00 22.00 2 236.72188 -208.437500
26 32 1316.00 345.00 2 -1197.27812 114.562500
27 33 4595.00 4595.00 3 -2419.33077 3243.190769
28 34 5928.00 45.00 3 -1086.33077 -1306.809231
29 35 2645.70 0.00 2 132.42188 -230.437500
30 36 2580.24 454.00 2 66.96187 223.562500
31 37 6547.34 6547.34 3 -466.99077 5195.530769
32 38 1615.68 5.00 2 -897.59812 -225.437500
33 39 194.06 55.00 1 36.88750 -26.650000
34 4 8426.88 22.20 3 1412.54923 -1329.609231
35 40 184.80 6.00 1 27.62750 -75.650000
36 41 82.94 44.00 1 -74.23250 -37.650000
37 42 16649.00 56.00 4 -5420.36857 13.001429
38 43 4500.00 74.00 3 -2514.33077 -1277.809231
39 44 1600.00 744.00 2 -913.27812 513.562500
40 5 5706.24 333.00 3 -1308.09077 -1018.809231
41 6 3982.08 560.00 2 1468.80188 329.562500
42 7 3642.62 670.00 2 1129.34188 439.562500
43 8 295.68 124.00 1 138.50750 42.350000
44 9 215.40 104.00 1 58.22750 22.350000
With respect to getting the correlation coefficient for each sub-group,
you can do the following:
> unlist(lapply(split(DF[, -3], DF$index), function(x) cor(x)[1, 2]))
1 2 3 4
0.4468744 0.2619220 -0.3608070 0.3848641
See ?split, ?lapply, ?scale, ?do.call, ?rbind, ?unlist, ?merge and ?cor
HTH,
Marc Schwartz
More information about the R-help
mailing list