[R] Antwort: Re: Re: Spread data.frame on 2 variables (SOLVED)
G.Maubach at weinwolf.de
G.Maubach at weinwolf.de
Tue Aug 2 15:13:09 CEST 2016
Hi Ulrik,
many thanks for your help.
The problem was that R regards a dataset with a combination like
caseID custID channel unit
1 1000 10 10
2 1000 20 10
3 1000 20 30
as two diffrenet sets of cases: 1 set = case 1, 2 set = case 2 and 3 due
to the different values of unit in case 3 value 30, althought all cases
should be restructured based just on custID.
To get a dataset like
caseID custID channel -10 channel-20 unit-10
unit-30
1 1000 1 1 1 1
instead of
caseID custID channel -10 channel-20 unit-10
unit-30
1 1000 1 1 1 NA
2 1000 NA 1 NA 1
I used the approach you suggested:
1. I created a subset of my data with the first variable to be
restructured:
d_temp1 <- dataset[ , c("custID", "channel"))
2. I deleted all the cases the were dupliates
d_temp1 <- duplicated(d_temp1, c("custID", "channel")
3. I introduced a dummy variable delivering the values for the new
variables created by dplyr:spread()
d_temp1$dummy <- 1
4. Then I restructured the subset
d_temp1 <- dplyr::spread(d_temp1, key_variable = "channel", value =
d_temp1$dummy)
5. I repeaed steps 1 to 4 with the other variable "unit" (instead of
"channel") creating a new dataset named d_temp2.
6. I deleted the variables used for restructuring in steps 1 to 5
"channel" and "unit" from the original dataset "dataset".
dataset$channel <- NULL
dataset$unit <- NULL
7. I checked if I still had duplicates
duplicates <- duplicated(dataset, key_variable = c("Debitor"))
sum(duplicates) # was 0 it this time
8. I merged the datasets back together
dataset_2 <- merge(x = dataset, y = d_temp1, by.x = "Debitor", by.y =
"Debitor", all.x = TRUE, all.y = TRUE) # leaving out all.y would be fine
dataset_2 <- merge(x = dataset2, y = d_temp2, by.x = "Debitor", by.y =
"Debitor", all.x = TRUE, all.y = TRUE) # leaving out all.y would be fine
There might be a combination of commands and functions doing the same
thing in one step but I find that this is clear, comprehensible and
reproducable even at a later date or by other readers willing to use base
R for their work.
Many thanks again for your help.
Kind regards
Georg
Von: Ulrik Stervbo <ulrik.stervbo at gmail.com>
An: G.Maubach at weinwolf.de, R-help <r-help at r-project.org>,
Datum: 28.07.2016 14:20
Betreff: Re: Re: [R] Spread data.frame on 2 variables
Hi Georg,
it is difficult to figure out what happens between your expectation and
the outcome if we cannot see a minimal dataset.
Based on your description I did this
library(tidyr)
library(dplyr)
test_df <- data_frame(channel = LETTERS[1:5], unit = letters[1:5], custID
= c(1:5), dummy = 1)
test_df %>% spread(channel, dummy) %>% mutate(dummy = 1) %>% spread(unit,
dummy)
which seems to be working fine as I get wide data. If a combination is
missing in the long form it will also be missing in the wide form. Maybe
you are looking for something like this:
channel_wide <- test_df %>% select(channel, custID) %>% spread(channel,
custID)
unit_wide <- test_df %>% select(unit, custID) %>% spread(unit, custID)
bind_cols(channel_wide, unit_wide)
Apologies for the HTML - it's gmail
Best wishes,
Ulrik
On Thu, 28 Jul 2016 at 13:54 <G.Maubach at weinwolf.de> wrote:
Hi Ulrik,
I have included a reproducable example. I ran the code and it did exactly
what I wanted to show you.
You are right: the solution shall merge cases in the end cause the values
on the variables are either missing or the same.
Example 1: Values are the same
If you look at 6 and 7 and variable 70 the value is 1 in both cases. This
is in this context the same information and cases 6 and 7 with custID can
be merged to 1 for variable 70.
Example 2: Values are missing and not missing
If you look at cases 8 and 9 the value for case 8 at variable 40, 50 and
2000 is missing whereas the variables 40, 50 and 2000 have all 1 for case
9. Case 8 and 9 could be merged together cause the missing values are
overwritten what is correct in this case.
The solution I am looking for is to transform the data from long into wide
form and keep all but missing value information.
Did I explain my problem in a comprehensible way? Are there any further
questions?
Kind regards
Georg
Von: Ulrik Stervbo <ulrik.stervbo at gmail.com>
An: G.Maubach at weinwolf.de, r-help at r-project.org,
Datum: 28.07.2016 12:59
Betreff: Re: [R] Spread data.frame on 2 variables
Hi Georg,
it's hard to tell without a reproducible example.
Should spread really merge elements? Does spread know anything about
CustID? Maybe you need to make a useful key of the CustIDs first and
spread on that?
Maybe I'm all off, because I'm really just guessing.
Best,
Ulrik
On Thu, 28 Jul 2016 at 12:36 <G.Maubach at weinwolf.de> wrote:
Hi All,
I need to spread a data.frame on 2 variables, e. g. "channel" and "unit".
If I do it in two steps spreads keeps all cases that does not look like
the one before although it contains the same values for a specific case.
Here is what I have right now:
-- cut --
test1$dummy <- 1
test2 <- spread(data = test1, key = 'channel', value = "dummy")
test2
cat("First spread is OK!")
test2$dummy <- 1
test3 <- spread(data = test2, key = 'unit', value = 'dummy')
test1
# test2
test3
warning(paste0("Second spread is not OK cause spread does not merge
cases\n",
"with CustID 700 and 800 into one case,\n",
"cause they have values on different variables,\n",
"although the corresponding values of the cases with",
"custID 700 and 800 are missing."))
cat("What I would like to have is:\n")
target4 <- structure(list(custID = c(100, 200, 300, 500, 600, 700, 800,
900),
`10` = c(1, NA, NA, NA, NA, NA, NA, NA),
`20` = c(1, NA, NA, NA, NA, NA, NA, NA),
`30` = c(NA, NA, NA, NA, NA, NA, 1, 1),
`40` = c(NA, NA, NA, NA, 1, NA, 1, 1),
`50` = c(NA, NA, 1, NA, NA, NA, 1, 1),
`60` = c(NA, NA, NA, NA, NA, 1, NA, NA),
`70` = c(NA, NA, NA, NA, NA, 1, NA, NA),
`99` = c(NA, 1, NA, 1, NA, NA, NA, NA),
`1000` = c(1, NA, NA, NA, NA, NA, 1, 1),
`2000` = c(NA, NA, NA, NA, 1, 1, 1, NA),
`3000` = c(NA, NA, 1, NA, NA, 1, NA, NA),
`4000` = c(NA, NA, 1, NA, NA, NA, NA, NA),
`6000` = c(NA, NA, NA, NA, 1, NA, NA, NA),
`9999` = c(NA, 1, NA, 1, NA, NA, NA, NA)),
.Names = c("custID",
"10", "20", "30", "40", "50", "60", "70", "99",
"1000", "2000", "3000", "4000", "6000", "9999"),
row.names = c(NA, 8L), class = "data.frame")
target4
cat("What would be a proper way to create target4 from test1?")
-- cut --
What would be the proper way to create target4 from test1?
Kind regards
Georg
______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list