[R] Joining two datasets - recursive procedure?

Luca Meyer lucam1968 at gmail.com
Sun Mar 22 16:21:13 CET 2015


Hi Bert,

Thanks again for your assistance.

Unfortunately when I apply the additional code you suggest I get B=40.23326
& C=-8.66603 and not  B=29 & C=2.56723. Any idea why that might be
happening?

Please see below or on
https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0 the code I
am running:

rm(list=ls())

# this is (an extract of) the INPUT file I have:
f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
"B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
"B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
"B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917,
1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame", row.names =
c(2L,
9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))

# this is the procedure that Bert suggested (slightly adjusted):

y <- f1$v3  ## to simplify the notation; could be done using with()
z <- (c(29,2.567)/table(y))[c(y)]
# z <- rnorm(nrow(f1)) ## or anything you want
z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
aggregate(v4~v1*v2,f1,sum)
aggregate(z1~v1*v2,f1,sum)
aggregate(v4~v3,f1,sum)
aggregate(z1~v3,f1,sum)

Thanks again,

Luca


2015-03-22 15:55 GMT+01:00 Bert Gunter <gunter.berton a gene.com>:

> I would have thought that this is straightforward given my previous
> email...
>
> Just set z to what you want -- e,g, all B values to 29/number of B's,
> and all C values to 2.567/number of C's (etc. for more categories).
>
> A slick but sort of cheat way to do this programmatically -- in the
> sense that it relies on the implementation of factor() rather than its
> API -- is:
>
> y <- f1$v3  ## to simplify the notation; could be done using with()
> z <- (c(29,2.567)/table(y))[c(y)]
>
> Then proceed to z1 as I previously described
>
> -- Bert
>
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
> (650) 467-7374
>
> "Data is not information. Information is not knowledge. And knowledge
> is certainly not wisdom."
> Clifford Stoll
>
>
>
>
> On Sun, Mar 22, 2015 at 2:00 AM, Luca Meyer <lucam1968 a gmail.com> wrote:
> > Hi Bert, hello R-experts,
> >
> > I am close to a solution but I still need one hint w.r.t. the following
> > procedure (available also from
> > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0)
> >
> > rm(list=ls())
> >
> > # this is (an extract of) the INPUT file I have:
> > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B", "B",
> > "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A", "B", "C", "A",
> > "B", "C"), v3 = c("B", "B", "B", "C", "C", "C", "B", "B", "B", "C", "C",
> > "C"), v4 = c(18.18530, 3.43806,0.00273, 1.42917, 1.05786, 0.00042,
> 2.37232,
> > 3.01835, 0, 1.13430, 0.92872, 0)), .Names = c("v1", "v2", "v3", "v4"),
> class
> > = "data.frame", row.names = c(2L, 9L, 11L, 41L, 48L, 50L, 158L, 165L,
> 167L,
> > 197L, 204L, 206L))
> >
> > # this is the procedure that Bert suggested (slightly adjusted):
> > z <- rnorm(nrow(f1)) ## or anything you want
> > z1 <- round(with(f1,v4 + z -ave(z,v1,v2,FUN=mean)), digits=5)
> > aggregate(v4~v1*v2,f1,sum)
> > aggregate(z1~v1*v2,f1,sum)
> > aggregate(v4~v3,f1,sum)
> > aggregate(z1~v3,f1,sum)
> >
> > My question to you is: how can I set z so that I can obtain specific
> values
> > for z1-v4 in the v3 aggregation?
> > In other words, how can I configure the procedure so that e.g. B=29 and
> > C=2.56723 after running the procedure:
> > aggregate(z1~v3,f1,sum)
> >
> > Thank you,
> >
> > Luca
> >
> > PS: to avoid any doubts you might have about who I am the following is my
> > web page: http://lucameyer.wordpress.com/
> >
> >
> > 2015-03-21 18:13 GMT+01:00 Bert Gunter <gunter.berton a gene.com>:
> >>
> >> ... or cleaner:
> >>
> >> z1 <- with(f1,v4 + z -ave(z,v1,v2,FUN=mean))
> >>
> >>
> >> Just for curiosity, was this homework? (in which case I should
> >> probably have not provided you an answer -- that is, assuming that I
> >> HAVE provided an answer).
> >>
> >> Cheers,
> >> Bert
> >>
> >> Bert Gunter
> >> Genentech Nonclinical Biostatistics
> >> (650) 467-7374
> >>
> >> "Data is not information. Information is not knowledge. And knowledge
> >> is certainly not wisdom."
> >> Clifford Stoll
> >>
> >>
> >>
> >>
> >> On Sat, Mar 21, 2015 at 7:53 AM, Bert Gunter <bgunter a gene.com> wrote:
> >> > z <- rnorm(nrow(f1)) ## or anything you want
> >> > z1 <- f1$v4 + z - with(f1,ave(z,v1,v2,FUN=mean))
> >> >
> >> >
> >> > aggregate(v4~v1,f1,sum)
> >> > aggregate(z1~v1,f1,sum)
> >> > aggregate(v4~v2,f1,sum)
> >> > aggregate(z1~v2,f1,sum)
> >> > aggregate(v4~v3,f1,sum)
> >> > aggregate(z1~v3,f1,sum)
> >> >
> >> >
> >> > Cheers,
> >> > Bert
> >> >
> >> > Bert Gunter
> >> > Genentech Nonclinical Biostatistics
> >> > (650) 467-7374
> >> >
> >> > "Data is not information. Information is not knowledge. And knowledge
> >> > is certainly not wisdom."
> >> > Clifford Stoll
> >> >
> >> >
> >> >
> >> >
> >> > On Sat, Mar 21, 2015 at 6:49 AM, Luca Meyer <lucam1968 a gmail.com>
> wrote:
> >> >> Hi Bert,
> >> >>
> >> >> Thank you for your message. I am looking into ave() and tapply() as
> you
> >> >> suggested but at the same time I have prepared a example of input and
> >> >> output
> >> >> files, just in case you or someone else would like to make an attempt
> >> >> to
> >> >> generate a code that goes from input to output.
> >> >>
> >> >> Please see below or download it from
> >> >> https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0
> >> >>
> >> >> # this is (an extract of) the INPUT file I have:
> >> >> f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
> >> >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
> >> >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
> >> >> "B", "B", "B", "C", "C", "C"), v4 = c(18.18530, 3.43806,0.00273,
> >> >> 1.42917,
> >> >> 1.05786, 0.00042, 2.37232, 3.01835, 0, 1.13430, 0.92872,
> >> >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
> >> >> row.names =
> >> >> c(2L,
> >> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
> >> >>
> >> >> # this is (an extract of) the OUTPUT file I would like to obtain:
> >> >> f2 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B", "B",
> >> >> "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
> >> >> "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
> >> >> "B", "B", "B", "C", "C", "C"), v4 = c(17.83529, 3.43806,0.00295,
> >> >> 1.77918,
> >> >> 1.05786, 0.0002, 2.37232, 3.01835, 0, 1.13430, 0.92872,
> >> >> 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
> >> >> row.names =
> >> >> c(2L,
> >> >> 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
> >> >>
> >> >> # please notice that while the aggregated v4 on v3 has changed …
> >> >> aggregate(f1[,c("v4")],list(f1$v3),sum)
> >> >> aggregate(f2[,c("v4")],list(f2$v3),sum)
> >> >>
> >> >> # … the aggregated v4 over v1xv2 has remained unchanged:
> >> >> aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >> >> aggregate(f2[,c("v4")],list(f2$v1,f2$v2),sum)
> >> >>
> >> >> Thank you very much in advance for your assitance.
> >> >>
> >> >> Luca
> >> >>
> >> >> 2015-03-21 13:18 GMT+01:00 Bert Gunter <gunter.berton a gene.com>:
> >> >>>
> >> >>> 1. Still not sure what you mean, but maybe look at ?ave and ?tapply,
> >> >>> for which ave() is a wrapper.
> >> >>>
> >> >>> 2. You still need to heed the rest of Jeff's advice.
> >> >>>
> >> >>> Cheers,
> >> >>> Bert
> >> >>>
> >> >>> Bert Gunter
> >> >>> Genentech Nonclinical Biostatistics
> >> >>> (650) 467-7374
> >> >>>
> >> >>> "Data is not information. Information is not knowledge. And
> knowledge
> >> >>> is certainly not wisdom."
> >> >>> Clifford Stoll
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Sat, Mar 21, 2015 at 4:53 AM, Luca Meyer <lucam1968 a gmail.com>
> >> >>> wrote:
> >> >>> > Hi Jeff & other R-experts,
> >> >>> >
> >> >>> > Thank you for your note. I have tried myself to solve the issue
> >> >>> > without
> >> >>> > success.
> >> >>> >
> >> >>> > Following your suggestion, I am providing a sample of the dataset
> I
> >> >>> > am
> >> >>> > using below (also downloadble in plain text from
> >> >>> > https://www.dropbox.com/s/qhmpkkrejjkpbkx/sample_code.txt?dl=0):
> >> >>> >
> >> >>> > #this is an extract of the overall dataset (n=1200 cases)
> >> >>> > f1 <- structure(list(v1 = c("A", "A", "A", "A", "A", "A", "B",
> "B",
> >> >>> > "B", "B", "B", "B"), v2 = c("A", "B", "C", "A", "B", "C", "A",
> >> >>> > "B", "C", "A", "B", "C"), v3 = c("B", "B", "B", "C", "C", "C",
> >> >>> > "B", "B", "B", "C", "C", "C"), v4 = c(18.1853007621835,
> >> >>> > 3.43806581506388,
> >> >>> > 0.002733567617055, 1.42917483425029, 1.05786640463504,
> >> >>> > 0.000420548864162308,
> >> >>> > 2.37232740842861, 3.01835841813241, 0, 1.13430282139936,
> >> >>> > 0.928725667117666,
> >> >>> > 0)), .Names = c("v1", "v2", "v3", "v4"), class = "data.frame",
> >> >>> > row.names
> >> >>> > =
> >> >>> > c(2L,
> >> >>> > 9L, 11L, 41L, 48L, 50L, 158L, 165L, 167L, 197L, 204L, 206L))
> >> >>> >
> >> >>> > I need to find a automated procedure that allows me to adjust v3
> >> >>> > marginals
> >> >>> > while maintaining v1xv2 marginals unchanged.
> >> >>> >
> >> >>> > That is: modify the v4 values you can find by running:
> >> >>> >
> >> >>> > aggregate(f1[,c("v4")],list(f1$v3),sum)
> >> >>> >
> >> >>> > while maintaining costant the values you can find by running:
> >> >>> >
> >> >>> > aggregate(f1[,c("v4")],list(f1$v1,f1$v2),sum)
> >> >>> >
> >> >>> > Now does it make sense?
> >> >>> >
> >> >>> > Please notice I have tried to build some syntax that tries to
> modify
> >> >>> > values
> >> >>> > within each v1xv2 combination by computing sum of v4, row
> percentage
> >> >>> > in
> >> >>> > terms of v4, and there is where my effort is blocked. Not really
> >> >>> > sure
> >> >>> > how I
> >> >>> > should proceed. Any suggestion?
> >> >>> >
> >> >>> > Thanks,
> >> >>> >
> >> >>> > Luca
> >> >>> >
> >> >>> >
> >> >>> > 2015-03-19 2:38 GMT+01:00 Jeff Newmiller <
> jdnewmil a dcn.davis.ca.us>:
> >> >>> >
> >> >>> >> I don't understand your description. The standard practice on
> this
> >> >>> >> list
> >> >>> >> is
> >> >>> >> to provide a reproducible R example [1] of the kind of data you
> are
> >> >>> >> working
> >> >>> >> with (and any code you have tried) to go along with your
> >> >>> >> description.
> >> >>> >> In
> >> >>> >> this case, that would be two dputs of your input data frames and
> a
> >> >>> >> dput
> >> >>> >> of
> >> >>> >> an output data frame (generated by hand from your input data
> >> >>> >> frame).
> >> >>> >> (Probably best to not use the full number of input values just to
> >> >>> >> keep
> >> >>> >> the
> >> >>> >> size down.) We could then make an attempt to generate code that
> >> >>> >> goes
> >> >>> >> from
> >> >>> >> input to output.
> >> >>> >>
> >> >>> >> Of course, if you post that hard work using HTML then it will get
> >> >>> >> corrupted (much like the text below from your earlier emails) and
> >> >>> >> we
> >> >>> >> won't
> >> >>> >> be able to use it. Please learn to post from your email software
> >> >>> >> using
> >> >>> >> plain text when corresponding with this mailing list.
> >> >>> >>
> >> >>> >> [1]
> >> >>> >>
> >> >>> >>
> >> >>> >>
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> >> >>> >>
> >> >>> >>
> >> >>> >>
> ---------------------------------------------------------------------------
> >> >>> >> Jeff Newmiller                        The     .....       .....
> Go
> >> >>> >> Live...
> >> >>> >> DCN:<jdnewmil a dcn.davis.ca.us>        Basics: ##.#.       ##.#.
> >> >>> >> Live
> >> >>> >> Go...
> >> >>> >>                                       Live:   OO#.. Dead: OO#..
> >> >>> >> Playing
> >> >>> >> Research Engineer (Solar/Batteries            O.O#.       #.O#.
> >> >>> >> with
> >> >>> >> /Software/Embedded Controllers)               .OO#.       .OO#.
> >> >>> >> rocks...1k
> >> >>> >>
> >> >>> >>
> >> >>> >>
> ---------------------------------------------------------------------------
> >> >>> >> Sent from my phone. Please excuse my brevity.
> >> >>> >>
> >> >>> >> On March 18, 2015 9:05:37 AM PDT, Luca Meyer <
> lucam1968 a gmail.com>
> >> >>> >> wrote:
> >> >>> >> >Thanks for you input Michael,
> >> >>> >> >
> >> >>> >> >The continuous variable I have measures quantities (down to the
> >> >>> >> > 3rd
> >> >>> >> >decimal level) so unfortunately are not frequencies.
> >> >>> >> >
> >> >>> >> >Any more specific suggestions on how that could be tackled?
> >> >>> >> >
> >> >>> >> >Thanks & kind regards,
> >> >>> >> >
> >> >>> >> >Luca
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >===
> >> >>> >> >
> >> >>> >> >Michael Friendly wrote:
> >> >>> >> >I'm not sure I understand completely what you want to do, but
> >> >>> >> >if the data were frequencies, it sounds like task for fitting a
> >> >>> >> >loglinear model with the model formula
> >> >>> >> >
> >> >>> >> >~ V1*V2 + V3
> >> >>> >> >
> >> >>> >> >On 3/18/2015 2:17 AM, Luca Meyer wrote:
> >> >>> >> >>* Hello,
> >> >>> >> >*>>* I am facing a quite challenging task (at least to me) and I
> >> >>> >> > was
> >> >>> >> >wondering
> >> >>> >> >*>* if someone could advise how R could assist me to speed the
> >> >>> >> > task
> >> >>> >> > up.
> >> >>> >> >*>>* I am dealing with a dataset with 3 discrete variables and
> one
> >> >>> >> >continuous
> >> >>> >> >*>* variable. The discrete variables are:
> >> >>> >> >*>>* V1: 8 modalities
> >> >>> >> >*>* V2: 13 modalities
> >> >>> >> >*>* V3: 13 modalities
> >> >>> >> >*>>* The continuous variable V4 is a decimal number always
> greater
> >> >>> >> > than
> >> >>> >> >zero in
> >> >>> >> >*>* the marginals of each of the 3 variables but it is sometimes
> >> >>> >> > equal
> >> >>> >> >to zero
> >> >>> >> >*>* (and sometimes negative) in the joint tables.
> >> >>> >> >*>>* I have got 2 files:
> >> >>> >> >*>>* => one with distribution of all possible combinations of
> >> >>> >> > V1xV2
> >> >>> >> >(some of
> >> >>> >> >*>* which are zero or neagtive) and
> >> >>> >> >*>* => one with the marginal distribution of V3.
> >> >>> >> >*>>* I am trying to build the long and narrow dataset V1xV2xV3
> in
> >> >>> >> > such
> >> >>> >> >a way
> >> >>> >> >*>* that each V1xV2 cell does not get modified and V3 fits as
> >> >>> >> > closely
> >> >>> >> >as
> >> >>> >> >*>* possible to its marginal distribution. Does it make sense?
> >> >>> >> >*>>* To be even more specific, my 2 input files look like the
> >> >>> >> >following.
> >> >>> >> >*>>* FILE 1
> >> >>> >> >*>* V1,V2,V4
> >> >>> >> >*>* A, A, 24.251
> >> >>> >> >*>* A, B, 1.065
> >> >>> >> >*>* (...)
> >> >>> >> >*>* B, C, 0.294
> >> >>> >> >*>* B, D, 2.731
> >> >>> >> >*>* (...)
> >> >>> >> >*>* H, L, 0.345
> >> >>> >> >*>* H, M, 0.000
> >> >>> >> >*>>* FILE 2
> >> >>> >> >*>* V3, V4
> >> >>> >> >*>* A, 1.575
> >> >>> >> >*>* B, 4.294
> >> >>> >> >*>* C, 10.044
> >> >>> >> >*>* (...)
> >> >>> >> >*>* L, 5.123
> >> >>> >> >*>* M, 3.334
> >> >>> >> >*>>* What I need to achieve is a file such as the following
> >> >>> >> >*>>* FILE 3
> >> >>> >> >*>* V1, V2, V3, V4
> >> >>> >> >*>* A, A, A, ???
> >> >>> >> >*>* A, A, B, ???
> >> >>> >> >*>* (...)
> >> >>> >> >*>* D, D, E, ???
> >> >>> >> >*>* D, D, F, ???
> >> >>> >> >*>* (...)
> >> >>> >> >*>* H, M, L, ???
> >> >>> >> >*>* H, M, M, ???
> >> >>> >> >*>>* Please notice that FILE 3 need to be such that if I
> aggregate
> >> >>> >> > on
> >> >>> >> >V1+V2 I
> >> >>> >> >*>* recover exactly FILE 1 and that if I aggregate on V3 I can
> >> >>> >> > recover
> >> >>> >> >a file
> >> >>> >> >*>* as close as possible to FILE 3 (ideally the same file).
> >> >>> >> >*>>* Can anyone suggest how I could do that with R?
> >> >>> >> >*>>* Thank you very much indeed for any assistance you are able
> to
> >> >>> >> >provide.
> >> >>> >> >*>>* Kind regards,
> >> >>> >> >*>>* Luca*
> >> >>> >> >
> >> >>> >> >       [[alternative HTML version deleted]]
> >> >>> >> >
> >> >>> >> >______________________________________________
> >> >>> >> >R-help a r-project.org mailing list -- To UNSUBSCRIBE and more,
> see
> >> >>> >> >https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> >> >PLEASE do read the posting guide
> >> >>> >> >http://www.R-project.org/posting-guide.html
> >> >>> >> >and provide commented, minimal, self-contained, reproducible
> code.
> >> >>> >>
> >> >>> >>
> >> >>> >
> >> >>> >         [[alternative HTML version deleted]]
> >> >>> >
> >> >>> > ______________________________________________
> >> >>> > R-help a r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> >>> > https://stat.ethz.ch/mailman/listinfo/r-help
> >> >>> > PLEASE do read the posting guide
> >> >>> > http://www.R-project.org/posting-guide.html
> >> >>> > and provide commented, minimal, self-contained, reproducible code.
> >> >>
> >> >>
> >
> >
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list