[R] dplyr - add/expand rows
Tóth Dénes
toth.denes at kogentum.hu
Wed Nov 29 23:47:01 CET 2017
Hi Martin,
On 11/29/2017 10:46 PM, Martin Morgan wrote:
> On 11/29/2017 04:15 PM, Tóth Dénes wrote:
>> Hi,
>>
>> A benchmarking study with an additional (data.table-based) solution.
>
> I don't think speed is the right benchmark (I do agree that correctness
> is!).
Well, agree, and sorry for the wording. It was really just an exercise
and not a full evaluation of the approaches. When I read the avalanche
of solutions neither of which mentioning data.table (my first choice for
data.frame-manipulations), I became curious how a one-liner data.table
code performs against the other solutions in terms of speed and
readability.
Second, I quite often have the feeling that dplyr is extremely overused
among novice (and sometimes even experienced) R users nowadays. This is
unfortunate, as the present example also illustrates.
Regards,
Denes
>
> For the R-help list, maybe something about least specialized R knowledge
> required would be appropriate? I'd say there were some 'hard' solutions
> -- Michael (deep understanding of Bioconductor and IRanges), Toth (deep
> understanding of data.table), Jim (at least for me moderate
> understanding of dplyr,especially the .$ notation; a simpler dplyr
> answer might have moved this response out of the 'difficult' category,
> especially given the familiarity of the OP with dplyr). I'd vote for
> Bill's as requiring the least specialized knowledge of R (though the +/-
> 1 indexing is an easy thing to get wrong).
>
> A different criteria might be reuse across analysis scenarios. Bill
> seems to win here again, since the principles are very general and at
> least moderately efficient (both Bert and Martin's solutions are
> essentially R-level iterations and have poor scalability, as
> demonstrated in the microbenchmarks; Bill's is mostly vectorized).
> Certainly data.table, dplyr, and IRanges are extremely useful within the
> confines of the problem domains they address.
>
> Martin
>
>> Enjoy! ;)
>>
>> Cheers,
>> Denes
>>
>>
>> --------------------------
>>
>>
>> ## packages ##########################
>>
>> library(dplyr)
>> library(data.table)
>> library(IRanges)
>> library(microbenchmark)
>>
>> ## prepare example dataset ###########
>>
>> ## use Bert's example, with 2000 stations instead of 2
>> d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L),
>> from = as.integer(c(60,61,71,72,76,60,65,82,83)),
>> to = as.integer(c(60,70,71,76,83,64, 81, 82,83)),
>> record = c("A","B","C","B","D","B","B","D","E"),
>> stringsAsFactors = FALSE)
>> stations <- rle(d_df$station)
>> stations$value <- gsub(
>> " ", "0",
>> paste0("station", format(1:length(stations$value), width = 6)))
>> d_df$station <- rep(stations$value, stations$lengths)
>>
>> ## prepare tibble and data.table versions
>> d_tbl <- as_tibble(d_df)
>> d_dt <- as.data.table(d_df)
>>
>> ## solutions ##########################
>>
>> ## Bert - by
>> fun_bert <- function(d) {
>> out <- by(
>> d, d$station, function(x) with(x, {
>> i <- to - from +1
>> data.frame(record =rep(record,i),
>> year =sequence(i) -1 + rep(from,i),
>> stringsAsFactors = FALSE)
>> }))
>> data.frame(station = rep(names(out), sapply(out,nrow)),
>> do.call(rbind,out),
>> row.names = NULL,
>> stringsAsFactors = FALSE)
>> }
>>
>> ## Bill - transform
>> fun_bill <- function(d) {
>> i <- rep(seq_len(nrow(d)), d$to-d$from+1)
>> j <- sequence(d$to-d$from+1)
>> transform(d[i,], year=from+j-1, from=NULL, to=NULL)
>> }
>>
>> ## Michael - IRanges
>> fun_michael <- function(d) {
>> df <- with(d, DataFrame(station, record, year=IRanges(from, to)))
>> expand(df, "year")
>> }
>>
>> ## Jim - dplyr
>> fun_jim <- function(d) {
>> d %>%
>> rowwise() %>%
>> do(tibble(station = .$station,
>> record = .$record,
>> year = seq(.$from, .$to))
>> )
>> }
>>
>> ## Martin - Map
>> fun_martin <- function(d) {
>> d$year <- with(d, Map(seq, from, to))
>> res0 <- with(d, Map(data.frame,
>> station=station,
>> record=record,
>> year=year,
>> MoreArgs = list(stringsAsFactors = FALSE)))
>> do.call(rbind, unname(res0))
>> }
>>
>> ## Denes - simple data.table
>> fun_denes <- function(d) {
>> out <- d[, .(year = from:to), by = .(station, from, record)]
>> out[, from := NULL]
>> }
>>
>> ## Check equality ################################
>> all.equal(fun_bill(d_df), fun_bert(d_df),
>> check.attributes = FALSE)
>> all.equal(fun_bill(d_df), fun_martin(d_df),
>> check.attributes = FALSE)
>> all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)),
>> check.attributes = FALSE)
>> all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)),
>> check.attributes = FALSE)
>> # Be prepared: this solution is super slow
>> all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)),
>> check.attributes = FALSE)
>>
>> ## Benchmark #####################################
>>
>> ## Martin
>> print(system.time(fun_martin(d_df)))
>>
>> ## Bert
>> print(system.time(fun_bert(d_df)))
>>
>> ## Top 3
>> print(
>> microbenchmark(
>> fun_bill(d_df),
>> fun_michael(d_df),
>> fun_denes(d_dt),
>> times = 100L
>> )
>> )
>>
>>
>> -------------------------
>>
>> On 11/28/2017 06:49 PM, Michael Lawrence wrote:
>>> Or with the Bioconductor IRanges package:
>>>
>>> df <- with(input, DataFrame(station, year=IRanges(from, to), record))
>>> expand(df, "year")
>>>
>>> DataFrame with 24 rows and 3 columns
>>> station year record
>>> <character> <integer> <character>
>>> 1 07EA001 1960 QMS
>>> 2 07EA001 1961 QMC
>>> 3 07EA001 1962 QMC
>>> 4 07EA001 1963 QMC
>>> 5 07EA001 1964 QMC
>>> ... ... ... ...
>>> 20 07EA001 1979 QRC
>>> 21 07EA001 1980 QRC
>>> 22 07EA001 1981 QRC
>>> 23 07EA001 1982 QRC
>>> 24 07EA001 1983 QRC
>>>
>>> If you tell the computer more about your data, it can do more things for
>>> you.
>>>
>>> Michael
>>>
>>> On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan <
>>> martin.morgan at roswellpark.org> wrote:
>>>
>>>> On 11/26/2017 08:42 PM, jim holtman wrote:
>>>>
>>>>> try this:
>>>>>
>>>>> ##########################################
>>>>>
>>>>> library(dplyr)
>>>>>
>>>>> input <- tribble(
>>>>> ~station, ~from, ~to, ~record,
>>>>> "07EA001" , 1960 , 1960 , "QMS",
>>>>> "07EA001" , 1961 , 1970 , "QMC",
>>>>> "07EA001" , 1971 , 1971 , "QMM",
>>>>> "07EA001" , 1972 , 1976 , "QMC",
>>>>> "07EA001" , 1977 , 1983 , "QRC"
>>>>> )
>>>>>
>>>>> result <- input %>%
>>>>> rowwise() %>%
>>>>> do(tibble(station = .$station,
>>>>> year = seq(.$from, .$to),
>>>>> record = .$record)
>>>>> )
>>>>>
>>>>> ###########################
>>>>>
>>>>
>>>> In a bit more 'base R' mode I did
>>>>
>>>> input$year <- with(input, Map(seq, from, to))
>>>> res0 <- with(input, Map(data.frame, station=station, year=year,
>>>> record=record))
>>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>>>
>>>> resulting in
>>>>
>>>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>>> station year record
>>>> <fctr> <int> <fctr>
>>>> 1 07EA001 1960 QMS
>>>> 2 07EA001 1961 QMC
>>>> 3 07EA001 1962 QMC
>>>> 4 07EA001 1963 QMC
>>>> 5 07EA001 1964 QMC
>>>> 6 07EA001 1965 QMC
>>>> 7 07EA001 1966 QMC
>>>> 8 07EA001 1967 QMC
>>>> 9 07EA001 1968 QMC
>>>> 10 07EA001 1969 QMC
>>>> # ... with 14 more rows
>>>>
>>>> I though I should have been able to use `tibble` in the second step,
>>>> but
>>>> that leads to a (cryptic) error
>>>>
>>>>> res0 <- with(input, Map(tibble, station=station, year=year,
>>>> record=record))Error in captureDots(strict = `__quosured`) :
>>>> the argument has already been evaluated
>>>>
>>>> The 'station' and 'record' columns are factors, so different from the
>>>> original input, but this seems the appropriate data type for theses
>>>> columns.
>>>>
>>>> It's interesting to compare the 'specialized' knowledge needed for each
>>>> approach -- rowwise(), do(), .$ for tidyverse, with(), do.call(), maybe
>>>> rbind() and Map() for base R.
>>>>
>>>> Martin
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> Jim Holtman
>>>>> Data Munger Guru
>>>>>
>>>>> What is the problem that you are trying to solve?
>>>>> Tell me what you want to do, not how you want to do it.
>>>>>
>>>>> On Sun, Nov 26, 2017 at 2:10 PM, Bert Gunter <bgunter.4567 at gmail.com>
>>>>> wrote:
>>>>>
>>>>> To David W.'s point about lack of a suitable reprex ("reproducible
>>>>>> example"), Bill's solution seems to be for only one station.
>>>>>>
>>>>>> Here is a reprex and modification that I think does what was
>>>>>> requested
>>>>>> for
>>>>>> multiple stations, again using base R and data frames, not dplyr and
>>>>>> tibbles.
>>>>>>
>>>>>> First the reprex with **two** stations:
>>>>>>
>>>>>> d <- data.frame( station = rep(c("one","two"),c(5,4)),
>>>>>>>
>>>>>> from = c(60,61,71,72,76,60,65,82,83),
>>>>>> to = c(60,70,71,76,83,64, 81, 82,83),
>>>>>> record = c("A","B","C","B","D","B","B","D","E"))
>>>>>>
>>>>>> d
>>>>>>>
>>>>>> station from to record
>>>>>> 1 one 60 60 A
>>>>>> 2 one 61 70 B
>>>>>> 3 one 71 71 C
>>>>>> 4 one 72 76 B
>>>>>> 5 one 76 83 D
>>>>>> 6 two 60 64 B
>>>>>> 7 two 65 81 B
>>>>>> 8 two 82 82 D
>>>>>> 9 two 83 83 E
>>>>>>
>>>>>> ## Now the conversion code using base R, especially by():
>>>>>>
>>>>>> out <- by(d, d$station, function(x) with(x, {
>>>>>>>
>>>>>> + i <- to - from +1
>>>>>> + data.frame(YEAR =sequence(i) -1 +rep(from,i), RECORD
>>>>>> =rep(record,i))
>>>>>> + }))
>>>>>>
>>>>>>
>>>>>> out <- data.frame(station =
>>>>>>>
>>>>>> rep(names(out),sapply(out,nrow)),do.call(rbind,out), row.names =
>>>>>> NULL)
>>>>>>
>>>>>>
>>>>>> out
>>>>>>>
>>>>>> station YEAR RECORD
>>>>>> 1 one 60 A
>>>>>> 2 one 61 B
>>>>>> 3 one 62 B
>>>>>> 4 one 63 B
>>>>>> 5 one 64 B
>>>>>> 6 one 65 B
>>>>>> 7 one 66 B
>>>>>> 8 one 67 B
>>>>>> 9 one 68 B
>>>>>> 10 one 69 B
>>>>>> 11 one 70 B
>>>>>> 12 one 71 C
>>>>>> 13 one 72 B
>>>>>> 14 one 73 B
>>>>>> 15 one 74 B
>>>>>> 16 one 75 B
>>>>>> 17 one 76 B
>>>>>> 18 one 76 D
>>>>>> 19 one 77 D
>>>>>> 20 one 78 D
>>>>>> 21 one 79 D
>>>>>> 22 one 80 D
>>>>>> 23 one 81 D
>>>>>> 24 one 82 D
>>>>>> 25 one 83 D
>>>>>> 26 two 60 B
>>>>>> 27 two 61 B
>>>>>> 28 two 62 B
>>>>>> 29 two 63 B
>>>>>> 30 two 64 B
>>>>>> 31 two 65 B
>>>>>> 32 two 66 B
>>>>>> 33 two 67 B
>>>>>> 34 two 68 B
>>>>>> 35 two 69 B
>>>>>> 36 two 70 B
>>>>>> 37 two 71 B
>>>>>> 38 two 72 B
>>>>>> 39 two 73 B
>>>>>> 40 two 74 B
>>>>>> 41 two 75 B
>>>>>> 42 two 76 B
>>>>>> 43 two 77 B
>>>>>> 44 two 78 B
>>>>>> 45 two 79 B
>>>>>> 46 two 80 B
>>>>>> 47 two 81 B
>>>>>> 48 two 82 D
>>>>>> 49 two 83 E
>>>>>>
>>>>>> Cheers,
>>>>>> Bert
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Bert Gunter
>>>>>>
>>>>>> "The trouble with having an open mind is that people keep coming
>>>>>> along
>>>>>> and
>>>>>> sticking things into it."
>>>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>>
>>>>>> On Sat, Nov 25, 2017 at 4:49 PM, William Dunlap via R-help <
>>>>>> r-help at r-project.org> wrote:
>>>>>>
>>>>>> dplyr may have something for this, but in base R I think the
>>>>>> following
>>>>>>>
>>>>>> does
>>>>>>
>>>>>>> what you want. I've shortened the name of your data set to 'd'.
>>>>>>>
>>>>>>> i <- rep(seq_len(nrow(d)), d$YEAR_TO-d$YEAR_FROM+1)
>>>>>>> j <- sequence(d$YEAR_TO-d$YEAR_FROM+1)
>>>>>>> transform(d[i,], YEAR=YEAR_FROM+j-1, YEAR_FROM=NULL, YEAR_TO=NULL)
>>>>>>>
>>>>>>>
>>>>>>> Bill Dunlap
>>>>>>> TIBCO Software
>>>>>>> wdunlap tibco.com
>>>>>>>
>>>>>>> On Sat, Nov 25, 2017 at 11:18 AM, Hutchinson, David (EC) <
>>>>>>> david.hutchinson at canada.ca> wrote:
>>>>>>>
>>>>>>> I have a returned tibble of station operational record similar to
>>>>>>> the
>>>>>>>> following:
>>>>>>>>
>>>>>>>> data.collection
>>>>>>>>>
>>>>>>>> # A tibble: 5 x 4
>>>>>>>> STATION_NUMBER YEAR_FROM YEAR_TO RECORD
>>>>>>>> <chr> <int> <int> <chr>
>>>>>>>> 1 07EA001 1960 1960 QMS
>>>>>>>> 2 07EA001 1961 1970 QMC
>>>>>>>> 3 07EA001 1971 1971 QMM
>>>>>>>> 4 07EA001 1972 1976 QMC
>>>>>>>> 5 07EA001 1977 1983 QRC
>>>>>>>>
>>>>>>>> I would like to reshape this to one operational record (row) per
>>>>>>>> year
>>>>>>>>
>>>>>>> per
>>>>>>
>>>>>>> station. Something like:
>>>>>>>>
>>>>>>>> 07EA001 1960 QMS
>>>>>>>> 07EA001 1961 QMC
>>>>>>>> 07EA001 1962 QMC
>>>>>>>> 07EA001 1963 QMC
>>>>>>>> ...
>>>>>>>> 07EA001 1971 QMM
>>>>>>>>
>>>>>>>> Can this be done in dplyr easily?
>>>>>>>>
>>>>>>>> Thanks in advance,
>>>>>>>>
>>>>>>>> David
>>>>>>>>
>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>>>>> posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>
>>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>>>> posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>>> posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>>
>>>>> [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>> ng-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>>
>>>>
>>>> This email message may contain legally privileged
>>>> and/or...{{dropped:2}}
>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>> ng-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
>
> This email message may contain legally privileged and/or confidential
> information. If you are not the intended recipient(s), or the employee
> or agent responsible for the delivery of this message to the intended
> recipient(s), you are hereby notified that any disclosure, copying,
> distribution, or use of this email message is prohibited. If you have
> received this message in error, please notify the sender immediately by
> e-mail and delete this email message from your computer. Thank you.
>
--
Dr. Tóth Dénes ügyvezető
Kogentum Kft.
Tel.: 06-30-2583723
Web: www.kogentum.hu
More information about the R-help
mailing list