[R] Joining two datasets - recursive procedure?

Sat Mar 21 13:18:29 CET 2015

1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
for which ave() is a wrapper.

2. You still need to heed the rest of Jeff's advice.

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll

On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com> wrote:
> Hi Jeff & other R-experts,
>
> Thank you for your note. I have tried myself to solve the issue without
> success.
>
> Following your suggestion, I am providing a sample of the dataset I am
> using below (also downloadble in plain text from
> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
>
> #this is an extract of the overall dataset (n=1200 cases)
> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
> "B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835, 3.43806581506388,
> 0.002733567617055, 1.42917483425029, 1.05786640463504,
> 0.000420548864162308,
> 2.37232740842861, 3.01835841813241, 0, 1.13430282139936, 0.928725667117666,
> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
> c(2L,
> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
>
> I need to find a automated procedure that allows me to adjust v3 marginals
> while maintaining v1xv2 marginals unchanged.
>
> That is: modify the v4 values you can find by running:
>
> aggregate(f1[,c("v4")],list(f1$v3),sum)
>
> while maintaining costant the values you can find by running:
>
> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>
> Now does it make sense?
>
> Please notice I have tried to build some syntax that tries to modify values
> within each v1xv2 combination by computing sum of v4, row percentage in
> terms of v4, and there is where my effort is blocked. Not really sure how I
> should proceed. Any suggestion?
>
> Thanks,
>
> Luca
>
>
> 2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:
>
>> I don't understand your description. The standard practice on this list is
>> to provide a reproducible R example [1] of the kind of data you are working
>> with (and any code you have tried) to go along with your description. In
>> this case, that would be two dputs of your input data frames and a dput of
>> an output data frame (generated by hand from your input data frame).
>> (Probably best to not use the full number of input values just to keep the
>> size down.) We could then make an attempt to generate code that goes from
>> input to output.
>>
>> Of course, if you post that hard work using HTML then it will get
>> corrupted (much like the text below from your earlier emails) and we won't
>> be able to use it. Please learn to post from your email software using
>> plain text when corresponding with this mailing list.
>>
>> [1]
>> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>> ---------------------------------------------------------------------------
>> Jeff Newmiller                        The     .....       .....  Go Live...
>> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>> Go...
>>                                       Live:   OO#.. Dead: OO#..  Playing
>> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>> /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
>> ---------------------------------------------------------------------------
>> Sent from my phone. Please excuse my brevity.
>>
>> On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com> wrote:
>> >Thanks for you input Michael,
>> >
>> >The continuous variable I have measures quantities (down to the 3rd
>> >decimal level) so unfortunately are not frequencies.
>> >
>> >Any more specific suggestions on how that could be tackled?
>> >
>> >Thanks & kind regards,
>> >
>> >Luca
>> >
>> >
>> >===
>> >
>> >Michael Friendly wrote:
>> >I'm not sure I understand completely what you want to do, but
>> >if the data were frequencies, it sounds like task for fitting a
>> >loglinear model with the model formula
>> >
>> >~ V1*V2 + V3
>> >
>> >On 3/18/2015 2:17 AM, Luca Meyer wrote:
>> >>* Hello,
>> >*>>* I am facing a quite challenging task (at least to me) and I was
>> >wondering
>> >*>* if someone could advise how R could assist me to speed the task up.
>> >*>>* I am dealing with a dataset with 3 discrete variables and one
>> >continuous
>> >*>* variable. The discrete variables are:
>> >*>>* V1: 8 modalities
>> >*>* V2: 13 modalities
>> >*>* V3: 13 modalities
>> >*>>* The continuous variable V4 is a decimal number always greater than
>> >zero in
>> >*>* the marginals of each of the 3 variables but it is sometimes equal
>> >to zero
>> >*>* (and sometimes negative) in the joint tables.
>> >*>>* I have got 2 files:
>> >*>>* => one with distribution of all possible combinations of V1xV2
>> >(some of
>> >*>* which are zero or neagtive) and
>> >*>* => one with the marginal distribution of V3.
>> >*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such
>> >a way
>> >*>* that each V1xV2 cell does not get modified and V3 fits as closely
>> >as
>> >*>* possible to its marginal distribution. Does it make sense?
>> >*>>* To be even more specific, my 2 input files look like the
>> >following.
>> >*>>* FILE 1
>> >*>* V1,V2,V4
>> >*>* A, A, 24.251
>> >*>* A, B, 1.065
>> >*>* (...)
>> >*>* B, C, 0.294
>> >*>* B, D, 2.731
>> >*>* (...)
>> >*>* H, L, 0.345
>> >*>* H, M, 0.000
>> >*>>* FILE 2
>> >*>* V3, V4
>> >*>* A, 1.575
>> >*>* B, 4.294
>> >*>* C, 10.044
>> >*>* (...)
>> >*>* L, 5.123
>> >*>* M, 3.334
>> >*>>* What I need to achieve is a file such as the following
>> >*>>* FILE 3
>> >*>* V1, V2, V3, V4
>> >*>* A, A, A, ???
>> >*>* A, A, B, ???
>> >*>* (...)
>> >*>* D, D, E, ???
>> >*>* D, D, F, ???
>> >*>* (...)
>> >*>* H, M, L, ???
>> >*>* H, M, M, ???
>> >*>>* Please notice that FILE 3 need to be such that if I aggregate on
>> >V1+V2 I
>> >*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover
>> >a file
>> >*>* as close as possible to FILE 3 (ideally the same file).
>> >*>>* Can anyone suggest how I could do that with R?
>> >*>>* Thank you very much indeed for any assistance you are able to
>> >provide.
>> >*>>* Kind regards,
>> >*>>* Luca*
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >PLEASE do read the posting guide
>> >http://www.R-project.org/posting-guide.html
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.