[R] Antwort: Re: Re: Spread data.frame on 2 variables (SOLVED)

Tue Aug 2 15:13:09 CEST 2016

Hi Ulrik,

many thanks for your help.

The problem was that R regards a dataset with a combination like

caseID          custID          channel         unit
1               1000            10              10
2               1000            20              10
3               1000            20              30

as two diffrenet sets of cases: 1 set = case 1, 2 set = case 2 and 3 due 
to the different values of unit in case 3 value 30, althought all cases 
should be restructured based just on custID.

To get a dataset like

caseID          custID          channel -10     channel-20      unit-10 
unit-30
1               1000            1               1               1 1

instead of

caseID          custID          channel -10     channel-20      unit-10 
unit-30
1               1000            1               1               1 NA
2               1000            NA              1               NA 1

I used the approach you suggested:

1. I created a subset of my data with the first variable to be 
restructured:

d_temp1 <- dataset[ , c("custID", "channel"))

2. I deleted all the cases the were dupliates

d_temp1 <- duplicated(d_temp1, c("custID", "channel")

3. I introduced a dummy variable delivering the values for the new 
variables created by dplyr:spread()

d_temp1$dummy <- 1 

4. Then I restructured the subset
d_temp1 <- dplyr::spread(d_temp1, key_variable = "channel", value = 
d_temp1$dummy)

5. I repeaed steps 1 to 4 with the other variable "unit" (instead of 
"channel") creating a new dataset named d_temp2.

6. I deleted the variables used for restructuring in steps 1 to 5 
"channel" and "unit" from the original dataset "dataset".

dataset$channel <- NULL
dataset$unit <- NULL

7. I checked if I still had duplicates

duplicates <- duplicated(dataset, key_variable = c("Debitor"))

sum(duplicates)  # was 0 it this time

8. I merged the datasets back together

dataset_2 <- merge(x = dataset, y = d_temp1, by.x = "Debitor", by.y = 
"Debitor", all.x = TRUE, all.y = TRUE)  # leaving out all.y would be fine
dataset_2 <- merge(x = dataset2, y = d_temp2, by.x = "Debitor", by.y = 
"Debitor", all.x = TRUE, all.y = TRUE)  # leaving out all.y would be fine

There might be a combination of commands and functions doing the same 
thing in one step but I find that this is clear, comprehensible and 
reproducable even at a later date or by other readers willing to use base 
R for their work.

Many thanks again for your help.

Kind regards

Georg

Von:    Ulrik Stervbo <ulrik.stervbo at gmail.com>
An:     G.Maubach at weinwolf.de, R-help <r-help at r-project.org>, 
Datum:  28.07.2016 14:20
Betreff:        Re: Re: [R] Spread data.frame on 2 variables

Hi Georg,

it is difficult to figure out what happens between your expectation and 
the outcome if we cannot see a minimal dataset.

Based on your description I did this

library(tidyr)
library(dplyr)

test_df <- data_frame(channel = LETTERS[1:5], unit = letters[1:5], custID 
= c(1:5), dummy = 1)
test_df %>% spread(channel, dummy) %>% mutate(dummy = 1) %>% spread(unit, 
dummy) 

which seems to be working fine as I get wide data. If a combination is 
missing in the long form it will also be missing in the wide form. Maybe 
you are looking for something like this:

channel_wide <- test_df  %>% select(channel, custID) %>% spread(channel, 
custID) 
unit_wide <- test_df  %>% select(unit, custID) %>% spread(unit, custID) 
bind_cols(channel_wide, unit_wide)

Apologies for the HTML - it's gmail

Best wishes,
Ulrik

On Thu, 28 Jul 2016 at 13:54 <G.Maubach at weinwolf.de> wrote:
Hi Ulrik,

I have included a reproducable example. I ran the code and it did exactly
what I wanted to show you.

You are right: the solution shall merge cases in the end cause the values
on the variables are either missing or the same.

Example 1: Values are the same
If you look at 6 and 7 and variable 70 the value is 1 in both cases. This
is in this context the same information and cases 6 and 7 with custID can
be merged to 1 for variable 70.

Example 2: Values are missing and not missing
If you look at cases 8 and 9 the value for case 8 at variable 40, 50 and
2000 is missing whereas the variables 40, 50 and 2000 have all 1 for case
9. Case 8 and 9 could be merged together cause the missing values are
overwritten what is correct in this case.

The solution I am looking for is to transform the data from long into wide
form and keep all but missing value information.

Did I explain my problem in a comprehensible way? Are there any further
questions?

Kind regards

Georg

Von:    Ulrik Stervbo <ulrik.stervbo at gmail.com>
An:     G.Maubach at weinwolf.de, r-help at r-project.org,
Datum:  28.07.2016 12:59
Betreff:        Re: [R] Spread data.frame on 2 variables

Hi Georg,

it's hard to tell without a reproducible example.

Should spread really merge elements? Does spread know anything about
CustID? Maybe you need to make a useful key of the CustIDs first and
spread on that?

Maybe I'm all off, because I'm really just guessing.

Best,
Ulrik

On Thu, 28 Jul 2016 at 12:36 <G.Maubach at weinwolf.de> wrote:
Hi All,

I need to spread a data.frame on 2 variables, e. g. "channel" and "unit".

If I do it in two steps spreads keeps all cases that does not look like
the one before although it contains the same values for a specific case.

Here is what I have right now:

-- cut --

test1$dummy <- 1
test2 <- spread(data = test1, key = 'channel', value = "dummy")
test2
cat("First spread is OK!")

test2$dummy <- 1
test3 <- spread(data = test2, key = 'unit', value = 'dummy')

test1
# test2
test3
warning(paste0("Second spread is not OK cause spread does not merge
cases\n",
               "with CustID 700 and 800 into one case,\n",
               "cause they have values on different variables,\n",
               "although the corresponding values of the cases with",
               "custID 700 and 800 are missing."))

cat("What I would like to have is:\n")
target4 <- structure(list(custID = c(100, 200, 300, 500, 600, 700, 800,
900),
  `10` = c(1, NA, NA, NA, NA, NA, NA, NA),
  `20` = c(1, NA, NA, NA, NA, NA, NA, NA),
  `30` = c(NA, NA, NA, NA, NA, NA, 1, 1),
  `40` = c(NA, NA, NA, NA, 1, NA, 1, 1),
  `50` = c(NA, NA, 1, NA, NA, NA, 1, 1),
  `60` = c(NA, NA, NA, NA, NA, 1, NA, NA),
  `70` = c(NA, NA, NA, NA, NA, 1, NA, NA),
  `99` = c(NA, 1, NA, 1, NA, NA, NA, NA),
  `1000` = c(1, NA, NA, NA, NA, NA, 1, 1),
  `2000` = c(NA, NA, NA, NA, 1, 1, 1, NA),
  `3000` = c(NA, NA, 1, NA, NA, 1, NA, NA),
  `4000` = c(NA, NA, 1, NA, NA, NA, NA, NA),
  `6000` = c(NA, NA, NA, NA, 1, NA, NA, NA),
  `9999` = c(NA, 1, NA, 1, NA, NA, NA, NA)),
.Names = c("custID",
 "10",  "20",  "30",  "40",  "50",  "60",  "70",  "99",
 "1000",  "2000",  "3000",  "4000",  "6000",  "9999"),
row.names = c(NA, 8L), class = "data.frame")

target4

cat("What would be a proper way to create target4 from test1?")

-- cut --

What would be the proper way to create target4 from test1?

Kind regards

Georg

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.