[R] Complicated analysis for huge databases

Allaisone 1 allaisone1 at hotmail.com
Fri Nov 17 18:59:33 CET 2017


Hi all ..,


I have a large dataset of around 600,000 rows and 600 columns. The first col is codes for Meal A, the second columns is codes for Meal B. The third column is customers IDs where each customer had a combination of meals. Each column of the rest columns contains values 0,1,or 2. The dataset is organised in a way so that the first group of customers had similar meals combinations, this is followed by another group of customers with similar meals combinations but different from the first group and so on. The dataset looks like this :-


> MyData

       Meal A     Meal B     Cust.ID      I            II        III     IV   ...... 600

1    33                 55             1             0           1        2       0

2    33                 55              3             1          0        2        2

3    33                 55              5             2          1        1         2

4    44                 66               7            0          2         2        2

5   44                  66               4            1          1          0       1

6   44                  66                9            2          0          1       2

.

.

600,000



I wanted to find maf() for each column(from 4 to 600) after calculating the frequency of the 3 values (0,1,2) but this should be done group by group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).


I can do the analysis  for the entire column but not group by group like this :


MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))

How can I modify this code to tell R to do the analysis group by group for each column so I get maf value for 33-55 group of clolumn I, then maf value for group 44-66 in the same column I,then the rest of groups in this column and do the same for the remaining columns.

In fact, I'm interested in doing this analysis for only 300 columns but all of the 600 columns.
I have another sheet contains names of columns of interest like this :

>ColOfinterest

Col
I
IV
V
.
.
300

Any one would help with the best combination of syntax to perform this complex analysis?

Regards
Allaisone







	[[alternative HTML version deleted]]



More information about the R-help mailing list