[R] Complicated analysis for huge databases

Sat Nov 18 01:54:07 CET 2017

Or do it at one go using ?tapply and friends

Bert

On Nov 17, 2017 1:12 PM, "Boris Steipe" <boris.steipe at utoronto.ca> wrote:

> Combine columns 1 and 2 into a column with a single ID like "33.55",
> "44.66" and use split() on these IDs to break up your dataset. Iterate over
> the list of data frames split() returns.
>
>
> B.
>
> > On Nov 17, 2017, at 12:59 PM, Allaisone 1 <allaisone1 at hotmail.com>
> wrote:
> >
> >
> > Hi all ..,
> >
> >
> > I have a large dataset of around 600,000 rows and 600 columns. The first
> col is codes for Meal A, the second columns is codes for Meal B. The third
> column is customers IDs where each customer had a combination of meals.
> Each column of the rest columns contains values 0,1,or 2. The dataset is
> organised in a way so that the first group of customers had similar meals
> combinations, this is followed by another group of customers with similar
> meals combinations but different from the first group and so on. The
> dataset looks like this :-
> >
> >
> >> MyData
> >
> >       Meal A     Meal B     Cust.ID      I            II        III
>  IV   ...... 600
> >
> > 1    33                 55             1             0           1
>   2       0
> >
> > 2    33                 55              3             1          0
>   2        2
> >
> > 3    33                 55              5             2          1
>   1         2
> >
> > 4    44                 66               7            0          2
>    2        2
> >
> > 5   44                  66               4            1          1
>     0       1
> >
> > 6   44                  66                9            2          0
>     1       2
> >
> > .
> >
> > .
> >
> > 600,000
> >
> >
> >
> > I wanted to find maf() for each column(from 4 to 600) after calculating
> the frequency of the 3 values (0,1,2) but this should be done group by
> group (i.e. group(33-55) : rows 1:3 then group(44-66) :rows 4:6 and so on).
> >
> >
> > I can do the analysis  for the entire column but not group by group like
> this :
> >
> >
> > MAF <- apply(MyData[,4:600], 2, function(x)maf(tabulate(x+1)))
> >
> > How can I modify this code to tell R to do the analysis group by group
> for each column so I get maf value for 33-55 group of clolumn I, then maf
> value for group 44-66 in the same column I,then the rest of groups in this
> column and do the same for the remaining columns.
> >
> > In fact, I'm interested in doing this analysis for only 300 columns but
> all of the 600 columns.
> > I have another sheet contains names of columns of interest like this :
> >
> >> ColOfinterest
> >
> > Col
> > I
> > IV
> > V
> > .
> > .
> > 300
> >
> > Any one would help with the best combination of syntax to perform this
> complex analysis?
> >
> > Regards
> > Allaisone
> >
> >
> >
> >
> >
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]