[R] Joining two datasets - recursive procedure?

Bert Gunter gunter.berton at gene.com
Sat Mar 21 18:13:19 CET 2015


... or cleaner:

z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))


Just for curiosity, was this homework? (in which case I should
probably have not provided you an answer -- that is, assuming that I
HAVE provided an answer).

Cheers,
Bert

Bert Gunter
Genentech Nonclinical Biostatistics
(650) 467-7374

"Data is not information. Information is not knowledge. And knowledge
is certainly not wisdom."
Clifford Stoll




On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter at gene.com> wrote:
> z <- rnorm(nrow(f1)) ## or anything you want
> z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))
>
>
> aggregate(v4~v1,f1,sum)
> aggregate(z1~v1,f1,sum)
> aggregate(v4~v2,f1,sum)
> aggregate(z1~v2,f1,sum)
> aggregate(v4~v3,f1,sum)
> aggregate(z1~v3,f1,sum)
>
>
> Cheers,
> Bert
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
> (650) 467-7374
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> Clifford Stoll
>
>
>
>
> On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 at gmail.com> wrote:
>> Hi Bert,
>>
>> Thank you for your message. I am looking into ave() and tapply() as you
>> suggested but at the same time I have prepared a example of input and output
>> files, just in case you or someone else would like to make an attempt to
>> generate a code that goes from input to output.
>>
>> Please see below or download it from
>> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
>>
>> # this is (an extract of) the INPUT file I have:
>> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
>> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
>> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
>> "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
>> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
>> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
>> c(2L,
>> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
>>
>> # this is (an extract of) the OUTPUT file I would like to obtain:
>> f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
>> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
>> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
>> "B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295, 1.77918,
>> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
>> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
>> c(2L,
>> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
>>
>> # please notice that while the aggregated v4 on v3 has changed …
>> aggregate(f1[,c("v4")],list(f1$v3),sum)
>> aggregate(f2[,c("v4")],list(f2$v3),sum)
>>
>> # … the aggregated v4 over v1xv2 has remained unchanged:
>> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
>>
>> Thank you very much in advance for your assitance.
>>
>> Luca
>>
>> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton at gene.com>:
>>>
>>> 1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
>>> for which ave() is a wrapper.
>>>
>>> 2. You still need to heed the rest of Jeff's advice.
>>>
>>> Cheers,
>>> Bert
>>>
>>> Bert Gunter
>>> Genentech Nonclinical Biostatistics
>>> (650) 467-7374
>>>
>>> "Data is not information. Information is not knowledge. And knowledge
>>> is certainly not wisdom."
>>> Clifford Stoll
>>>
>>>
>>>
>>>
>>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 at gmail.com> wrote:
>>> > Hi Jeff & other R-experts,
>>> >
>>> > Thank you for your note. I have tried myself to solve the issue without
>>> > success.
>>> >
>>> > Following your suggestion, I am providing a sample of the dataset I am
>>> > using below (also downloadble in plain text from
>>> > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
>>> >
>>> > #this is an extract of the overall dataset (n=1200 cases)
>>> > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
>>> > "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
>>> > "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
>>> > "B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835,
>>> > 3.43806581506388,
>>> > 0.002733567617055, 1.42917483425029, 1.05786640463504,
>>> > 0.000420548864162308,
>>> > 2.37232740842861, 3.01835841813241, 0, 1.13430282139936,
>>> > 0.928725667117666,
>>> > 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names
>>> > =
>>> > c(2L,
>>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
>>> >
>>> > I need to find a automated procedure that allows me to adjust v3
>>> > marginals
>>> > while maintaining v1xv2 marginals unchanged.
>>> >
>>> > That is: modify the v4 values you can find by running:
>>> >
>>> > aggregate(f1[,c("v4")],list(f1$v3),sum)
>>> >
>>> > while maintaining costant the values you can find by running:
>>> >
>>> > aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
>>> >
>>> > Now does it make sense?
>>> >
>>> > Please notice I have tried to build some syntax that tries to modify
>>> > values
>>> > within each v1xv2 combination by computing sum of v4, row percentage in
>>> > terms of v4, and there is where my effort is blocked. Not really sure
>>> > how I
>>> > should proceed. Any suggestion?
>>> >
>>> > Thanks,
>>> >
>>> > Luca
>>> >
>>> >
>>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller <jdnewmil at dcn.davis.ca.us>:
>>> >
>>> >> I don't understand your description. The standard practice on this list
>>> >> is
>>> >> to provide a reproducible R example [1] of the kind of data you are
>>> >> working
>>> >> with (and any code you have tried) to go along with your description.
>>> >> In
>>> >> this case, that would be two dputs of your input data frames and a dput
>>> >> of
>>> >> an output data frame (generated by hand from your input data frame).
>>> >> (Probably best to not use the full number of input values just to keep
>>> >> the
>>> >> size down.) We could then make an attempt to generate code that goes
>>> >> from
>>> >> input to output.
>>> >>
>>> >> Of course, if you post that hard work using HTML then it will get
>>> >> corrupted (much like the text below from your earlier emails) and we
>>> >> won't
>>> >> be able to use it. Please learn to post from your email software using
>>> >> plain text when corresponding with this mailing list.
>>> >>
>>> >> [1]
>>> >>
>>> >> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
>>> >>
>>> >> ---------------------------------------------------------------------------
>>> >> Jeff Newmiller                        The     .....       .....  Go
>>> >> Live...
>>> >> DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>>> >> Go...
>>> >>                                       Live:   OO#.. Dead: OO#..
>>> >> Playing
>>> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> >> /Software/Embedded Controllers)               .OO#.       .OO#.
>>> >> rocks...1k
>>> >>
>>> >> ---------------------------------------------------------------------------
>>> >> Sent from my phone. Please excuse my brevity.
>>> >>
>>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer <lucam1968 at gmail.com>
>>> >> wrote:
>>> >> >Thanks for you input Michael,
>>> >> >
>>> >> >The continuous variable I have measures quantities (down to the 3rd
>>> >> >decimal level) so unfortunately are not frequencies.
>>> >> >
>>> >> >Any more specific suggestions on how that could be tackled?
>>> >> >
>>> >> >Thanks & kind regards,
>>> >> >
>>> >> >Luca
>>> >> >
>>> >> >
>>> >> >===
>>> >> >
>>> >> >Michael Friendly wrote:
>>> >> >I'm not sure I understand completely what you want to do, but
>>> >> >if the data were frequencies, it sounds like task for fitting a
>>> >> >loglinear model with the model formula
>>> >> >
>>> >> >~ V1*V2 + V3
>>> >> >
>>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote:
>>> >> >>* Hello,
>>> >> >*>>* I am facing a quite challenging task (at least to me) and I was
>>> >> >wondering
>>> >> >*>* if someone could advise how R could assist me to speed the task
>>> >> > up.
>>> >> >*>>* I am dealing with a dataset with 3 discrete variables and one
>>> >> >continuous
>>> >> >*>* variable. The discrete variables are:
>>> >> >*>>* V1: 8 modalities
>>> >> >*>* V2: 13 modalities
>>> >> >*>* V3: 13 modalities
>>> >> >*>>* The continuous variable V4 is a decimal number always greater
>>> >> > than
>>> >> >zero in
>>> >> >*>* the marginals of each of the 3 variables but it is sometimes equal
>>> >> >to zero
>>> >> >*>* (and sometimes negative) in the joint tables.
>>> >> >*>>* I have got 2 files:
>>> >> >*>>* => one with distribution of all possible combinations of V1xV2
>>> >> >(some of
>>> >> >*>* which are zero or neagtive) and
>>> >> >*>* => one with the marginal distribution of V3.
>>> >> >*>>* I am trying to build the long and narrow dataset V1xV2xV3 in such
>>> >> >a way
>>> >> >*>* that each V1xV2 cell does not get modified and V3 fits as closely
>>> >> >as
>>> >> >*>* possible to its marginal distribution. Does it make sense?
>>> >> >*>>* To be even more specific, my 2 input files look like the
>>> >> >following.
>>> >> >*>>* FILE 1
>>> >> >*>* V1,V2,V4
>>> >> >*>* A, A, 24.251
>>> >> >*>* A, B, 1.065
>>> >> >*>* (...)
>>> >> >*>* B, C, 0.294
>>> >> >*>* B, D, 2.731
>>> >> >*>* (...)
>>> >> >*>* H, L, 0.345
>>> >> >*>* H, M, 0.000
>>> >> >*>>* FILE 2
>>> >> >*>* V3, V4
>>> >> >*>* A, 1.575
>>> >> >*>* B, 4.294
>>> >> >*>* C, 10.044
>>> >> >*>* (...)
>>> >> >*>* L, 5.123
>>> >> >*>* M, 3.334
>>> >> >*>>* What I need to achieve is a file such as the following
>>> >> >*>>* FILE 3
>>> >> >*>* V1, V2, V3, V4
>>> >> >*>* A, A, A, ???
>>> >> >*>* A, A, B, ???
>>> >> >*>* (...)
>>> >> >*>* D, D, E, ???
>>> >> >*>* D, D, F, ???
>>> >> >*>* (...)
>>> >> >*>* H, M, L, ???
>>> >> >*>* H, M, M, ???
>>> >> >*>>* Please notice that FILE 3 need to be such that if I aggregate on
>>> >> >V1+V2 I
>>> >> >*>* recover exactly FILE 1 and that if I aggregate on V3 I can recover
>>> >> >a file
>>> >> >*>* as close as possible to FILE 3 (ideally the same file).
>>> >> >*>>* Can anyone suggest how I could do that with R?
>>> >> >*>>* Thank you very much indeed for any assistance you are able to
>>> >> >provide.
>>> >> >*>>* Kind regards,
>>> >> >*>>* Luca*
>>> >> >
>>> >> >       [[alternative HTML version deleted]]
>>> >> >
>>> >> >______________________________________________
>>> >> >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> >> >https://stat.ethz.ch/mailman/listinfo/r-help
>>> >> >PLEASE do read the posting guide
>>> >> >http://www.R-project.org/posting-guide.html
>>> >> >and provide commented, minimal, self-contained, reproducible code.
>>> >>
>>> >>
>>> >
>>> >         [[alternative HTML version deleted]]
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>



More information about the R-help mailing list