[R] dplyr - add/expand rows
Martin Morgan
martin.morgan at roswellpark.org
Wed Nov 29 23:58:46 CET 2017
On 11/29/2017 05:47 PM, Tóth Dénes wrote:
> Hi Martin,
>
> On 11/29/2017 10:46 PM, Martin Morgan wrote:
>> On 11/29/2017 04:15 PM, Tóth Dénes wrote:
>>> Hi,
>>>
>>> A benchmarking study with an additional (data.table-based) solution.
>>
>> I don't think speed is the right benchmark (I do agree that
>> correctness is!).
>
> Well, agree, and sorry for the wording. It was really just an exercise
> and not a full evaluation of the approaches. When I read the avalanche
> of solutions neither of which mentioning data.table (my first choice for
> data.frame-manipulations), I became curious how a one-liner data.table
> code performs against the other solutions in terms of speed and
> readability.
> Second, I quite often have the feeling that dplyr is extremely overused
> among novice (and sometimes even experienced) R users nowadays. This is
> unfortunate, as the present example also illustrates.
Another solution is Bill's approach and dplyr's implementation (adding
the 1L to keep integers integers!)
fun_bill1 <- function(d) {
i <- rep(seq_len(nrow(d)), d$to - d$from + 1L)
j <- sequence(d$to - d$from + 1L)
## d[i,] %>% mutate(year = from + j - 1L, from = NULL, to = NULL)
mutate(d[i,], year = from + j - 1L, from = NULL, to = NULL)
}
which is competitive with IRanges and data.table (the more dplyr-ish?
solution
d[i, ] %>% mutate(year = from + j - 1L) %>%
select(station, record, year))
has intermediate performance) and might appeal to those introduced to R
through dplyr but wanting more base R knowledge, and vice versa. I think
if dplyr introduces new users to R, or exposes R users to new approaches
for working with data, that's great!
Martin
>
> Regards,
> Denes
>
>>
>> For the R-help list, maybe something about least specialized R
>> knowledge required would be appropriate? I'd say there were some
>> 'hard' solutions -- Michael (deep understanding of Bioconductor and
>> IRanges), Toth (deep understanding of data.table), Jim (at least for
>> me moderate understanding of dplyr,especially the .$ notation; a
>> simpler dplyr answer might have moved this response out of the
>> 'difficult' category, especially given the familiarity of the OP with
>> dplyr). I'd vote for Bill's as requiring the least specialized
>> knowledge of R (though the +/- 1 indexing is an easy thing to get wrong).
>>
>> A different criteria might be reuse across analysis scenarios. Bill
>> seems to win here again, since the principles are very general and at
>> least moderately efficient (both Bert and Martin's solutions are
>> essentially R-level iterations and have poor scalability, as
>> demonstrated in the microbenchmarks; Bill's is mostly vectorized).
>> Certainly data.table, dplyr, and IRanges are extremely useful within
>> the confines of the problem domains they address.
>>
>> Martin
>>
>>> Enjoy! ;)
>>>
>>> Cheers,
>>> Denes
>>>
>>>
>>> --------------------------
>>>
>>>
>>> ## packages ##########################
>>>
>>> library(dplyr)
>>> library(data.table)
>>> library(IRanges)
>>> library(microbenchmark)
>>>
>>> ## prepare example dataset ###########
>>>
>>> ## use Bert's example, with 2000 stations instead of 2
>>> d_df <- data.frame( station = rep(rep(c("one","two"),c(5,4)), 1000L),
>>> from = as.integer(c(60,61,71,72,76,60,65,82,83)),
>>> to = as.integer(c(60,70,71,76,83,64, 81, 82,83)),
>>> record = c("A","B","C","B","D","B","B","D","E"),
>>> stringsAsFactors = FALSE)
>>> stations <- rle(d_df$station)
>>> stations$value <- gsub(
>>> " ", "0",
>>> paste0("station", format(1:length(stations$value), width = 6)))
>>> d_df$station <- rep(stations$value, stations$lengths)
>>>
>>> ## prepare tibble and data.table versions
>>> d_tbl <- as_tibble(d_df)
>>> d_dt <- as.data.table(d_df)
>>>
>>> ## solutions ##########################
>>>
>>> ## Bert - by
>>> fun_bert <- function(d) {
>>> out <- by(
>>> d, d$station, function(x) with(x, {
>>> i <- to - from +1
>>> data.frame(record =rep(record,i),
>>> year =sequence(i) -1 + rep(from,i),
>>> stringsAsFactors = FALSE)
>>> }))
>>> data.frame(station = rep(names(out), sapply(out,nrow)),
>>> do.call(rbind,out),
>>> row.names = NULL,
>>> stringsAsFactors = FALSE)
>>> }
>>>
>>> ## Bill - transform
>>> fun_bill <- function(d) {
>>> i <- rep(seq_len(nrow(d)), d$to-d$from+1)
>>> j <- sequence(d$to-d$from+1)
>>> transform(d[i,], year=from+j-1, from=NULL, to=NULL)
>>> }
>>>
>>> ## Michael - IRanges
>>> fun_michael <- function(d) {
>>> df <- with(d, DataFrame(station, record, year=IRanges(from, to)))
>>> expand(df, "year")
>>> }
>>>
>>> ## Jim - dplyr
>>> fun_jim <- function(d) {
>>> d %>%
>>> rowwise() %>%
>>> do(tibble(station = .$station,
>>> record = .$record,
>>> year = seq(.$from, .$to))
>>> )
>>> }
>>>
>>> ## Martin - Map
>>> fun_martin <- function(d) {
>>> d$year <- with(d, Map(seq, from, to))
>>> res0 <- with(d, Map(data.frame,
>>> station=station,
>>> record=record,
>>> year=year,
>>> MoreArgs = list(stringsAsFactors = FALSE)))
>>> do.call(rbind, unname(res0))
>>> }
>>>
>>> ## Denes - simple data.table
>>> fun_denes <- function(d) {
>>> out <- d[, .(year = from:to), by = .(station, from, record)]
>>> out[, from := NULL]
>>> }
>>>
>>> ## Check equality ################################
>>> all.equal(fun_bill(d_df), fun_bert(d_df),
>>> check.attributes = FALSE)
>>> all.equal(fun_bill(d_df), fun_martin(d_df),
>>> check.attributes = FALSE)
>>> all.equal(fun_bill(d_df), as.data.frame(fun_michael(d_df)),
>>> check.attributes = FALSE)
>>> all.equal(fun_bill(d_df), as.data.frame(fun_denes(d_dt)),
>>> check.attributes = FALSE)
>>> # Be prepared: this solution is super slow
>>> all.equal(fun_bill(d_df), as.data.frame(fun_jim(d_tbl)),
>>> check.attributes = FALSE)
>>>
>>> ## Benchmark #####################################
>>>
>>> ## Martin
>>> print(system.time(fun_martin(d_df)))
>>>
>>> ## Bert
>>> print(system.time(fun_bert(d_df)))
>>>
>>> ## Top 3
>>> print(
>>> microbenchmark(
>>> fun_bill(d_df),
>>> fun_michael(d_df),
>>> fun_denes(d_dt),
>>> times = 100L
>>> )
>>> )
>>>
>>>
>>> -------------------------
>>>
>>> On 11/28/2017 06:49 PM, Michael Lawrence wrote:
>>>> Or with the Bioconductor IRanges package:
>>>>
>>>> df <- with(input, DataFrame(station, year=IRanges(from, to), record))
>>>> expand(df, "year")
>>>>
>>>> DataFrame with 24 rows and 3 columns
>>>> station year record
>>>> <character> <integer> <character>
>>>> 1 07EA001 1960 QMS
>>>> 2 07EA001 1961 QMC
>>>> 3 07EA001 1962 QMC
>>>> 4 07EA001 1963 QMC
>>>> 5 07EA001 1964 QMC
>>>> ... ... ... ...
>>>> 20 07EA001 1979 QRC
>>>> 21 07EA001 1980 QRC
>>>> 22 07EA001 1981 QRC
>>>> 23 07EA001 1982 QRC
>>>> 24 07EA001 1983 QRC
>>>>
>>>> If you tell the computer more about your data, it can do more things
>>>> for
>>>> you.
>>>>
>>>> Michael
>>>>
>>>> On Tue, Nov 28, 2017 at 7:34 AM, Martin Morgan <
>>>> martin.morgan at roswellpark.org> wrote:
>>>>
>>>>> On 11/26/2017 08:42 PM, jim holtman wrote:
>>>>>
>>>>>> try this:
>>>>>>
>>>>>> ##########################################
>>>>>>
>>>>>> library(dplyr)
>>>>>>
>>>>>> input <- tribble(
>>>>>> ~station, ~from, ~to, ~record,
>>>>>> "07EA001" , 1960 , 1960 , "QMS",
>>>>>> "07EA001" , 1961 , 1970 , "QMC",
>>>>>> "07EA001" , 1971 , 1971 , "QMM",
>>>>>> "07EA001" , 1972 , 1976 , "QMC",
>>>>>> "07EA001" , 1977 , 1983 , "QRC"
>>>>>> )
>>>>>>
>>>>>> result <- input %>%
>>>>>> rowwise() %>%
>>>>>> do(tibble(station = .$station,
>>>>>> year = seq(.$from, .$to),
>>>>>> record = .$record)
>>>>>> )
>>>>>>
>>>>>> ###########################
>>>>>>
>>>>>
>>>>> In a bit more 'base R' mode I did
>>>>>
>>>>> input$year <- with(input, Map(seq, from, to))
>>>>> res0 <- with(input, Map(data.frame, station=station, year=year,
>>>>> record=record))
>>>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>>>>
>>>>> resulting in
>>>>>
>>>>>> as_tibble(do.call(rbind, unname(res0)))# A tibble: 24 x 3
>>>>> station year record
>>>>> <fctr> <int> <fctr>
>>>>> 1 07EA001 1960 QMS
>>>>> 2 07EA001 1961 QMC
>>>>> 3 07EA001 1962 QMC
>>>>> 4 07EA001 1963 QMC
>>>>> 5 07EA001 1964 QMC
>>>>> 6 07EA001 1965 QMC
>>>>> 7 07EA001 1966 QMC
>>>>> 8 07EA001 1967 QMC
>>>>> 9 07EA001 1968 QMC
>>>>> 10 07EA001 1969 QMC
>>>>> # ... with 14 more rows
>>>>>
>>>>> I though I should have been able to use `tibble` in the second
>>>>> step, but
>>>>> that leads to a (cryptic) error
>>>>>
>>>>>> res0 <- with(input, Map(tibble, station=station, year=year,
>>>>> record=record))Error in captureDots(strict = `__quosured`) :
>>>>> the argument has already been evaluated
>>>>>
>>>>> The 'station' and 'record' columns are factors, so different from the
>>>>> original input, but this seems the appropriate data type for theses
>>>>> columns.
>>>>>
>>>>> It's interesting to compare the 'specialized' knowledge needed for
>>>>> each
>>>>> approach -- rowwise(), do(), .$ for tidyverse, with(), do.call(),
>>>>> maybe
>>>>> rbind() and Map() for base R.
>>>>>
>>>>> Martin
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Jim Holtman
>>>>>> Data Munger Guru
>>>>>>
>>>>>> What is the problem that you are trying to solve?
>>>>>> Tell me what you want to do, not how you want to do it.
>>>>>>
>>>>>> On Sun, Nov 26, 2017 at 2:10 PM, Bert Gunter <bgunter.4567 at gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> To David W.'s point about lack of a suitable reprex ("reproducible
>>>>>>> example"), Bill's solution seems to be for only one station.
>>>>>>>
>>>>>>> Here is a reprex and modification that I think does what was
>>>>>>> requested
>>>>>>> for
>>>>>>> multiple stations, again using base R and data frames, not dplyr and
>>>>>>> tibbles.
>>>>>>>
>>>>>>> First the reprex with **two** stations:
>>>>>>>
>>>>>>> d <- data.frame( station = rep(c("one","two"),c(5,4)),
>>>>>>>>
>>>>>>> from = c(60,61,71,72,76,60,65,82,83),
>>>>>>> to = c(60,70,71,76,83,64, 81, 82,83),
>>>>>>> record = c("A","B","C","B","D","B","B","D","E"))
>>>>>>>
>>>>>>> d
>>>>>>>>
>>>>>>> station from to record
>>>>>>> 1 one 60 60 A
>>>>>>> 2 one 61 70 B
>>>>>>> 3 one 71 71 C
>>>>>>> 4 one 72 76 B
>>>>>>> 5 one 76 83 D
>>>>>>> 6 two 60 64 B
>>>>>>> 7 two 65 81 B
>>>>>>> 8 two 82 82 D
>>>>>>> 9 two 83 83 E
>>>>>>>
>>>>>>> ## Now the conversion code using base R, especially by():
>>>>>>>
>>>>>>> out <- by(d, d$station, function(x) with(x, {
>>>>>>>>
>>>>>>> + i <- to - from +1
>>>>>>> + data.frame(YEAR =sequence(i) -1 +rep(from,i), RECORD
>>>>>>> =rep(record,i))
>>>>>>> + }))
>>>>>>>
>>>>>>>
>>>>>>> out <- data.frame(station =
>>>>>>>>
>>>>>>> rep(names(out),sapply(out,nrow)),do.call(rbind,out), row.names =
>>>>>>> NULL)
>>>>>>>
>>>>>>>
>>>>>>> out
>>>>>>>>
>>>>>>> station YEAR RECORD
>>>>>>> 1 one 60 A
>>>>>>> 2 one 61 B
>>>>>>> 3 one 62 B
>>>>>>> 4 one 63 B
>>>>>>> 5 one 64 B
>>>>>>> 6 one 65 B
>>>>>>> 7 one 66 B
>>>>>>> 8 one 67 B
>>>>>>> 9 one 68 B
>>>>>>> 10 one 69 B
>>>>>>> 11 one 70 B
>>>>>>> 12 one 71 C
>>>>>>> 13 one 72 B
>>>>>>> 14 one 73 B
>>>>>>> 15 one 74 B
>>>>>>> 16 one 75 B
>>>>>>> 17 one 76 B
>>>>>>> 18 one 76 D
>>>>>>> 19 one 77 D
>>>>>>> 20 one 78 D
>>>>>>> 21 one 79 D
>>>>>>> 22 one 80 D
>>>>>>> 23 one 81 D
>>>>>>> 24 one 82 D
>>>>>>> 25 one 83 D
>>>>>>> 26 two 60 B
>>>>>>> 27 two 61 B
>>>>>>> 28 two 62 B
>>>>>>> 29 two 63 B
>>>>>>> 30 two 64 B
>>>>>>> 31 two 65 B
>>>>>>> 32 two 66 B
>>>>>>> 33 two 67 B
>>>>>>> 34 two 68 B
>>>>>>> 35 two 69 B
>>>>>>> 36 two 70 B
>>>>>>> 37 two 71 B
>>>>>>> 38 two 72 B
>>>>>>> 39 two 73 B
>>>>>>> 40 two 74 B
>>>>>>> 41 two 75 B
>>>>>>> 42 two 76 B
>>>>>>> 43 two 77 B
>>>>>>> 44 two 78 B
>>>>>>> 45 two 79 B
>>>>>>> 46 two 80 B
>>>>>>> 47 two 81 B
>>>>>>> 48 two 82 D
>>>>>>> 49 two 83 E
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Bert
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Bert Gunter
>>>>>>>
>>>>>>> "The trouble with having an open mind is that people keep coming
>>>>>>> along
>>>>>>> and
>>>>>>> sticking things into it."
>>>>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>>>>>>
>>>>>>> On Sat, Nov 25, 2017 at 4:49 PM, William Dunlap via R-help <
>>>>>>> r-help at r-project.org> wrote:
>>>>>>>
>>>>>>> dplyr may have something for this, but in base R I think the
>>>>>>> following
>>>>>>>>
>>>>>>> does
>>>>>>>
>>>>>>>> what you want. I've shortened the name of your data set to 'd'.
>>>>>>>>
>>>>>>>> i <- rep(seq_len(nrow(d)), d$YEAR_TO-d$YEAR_FROM+1)
>>>>>>>> j <- sequence(d$YEAR_TO-d$YEAR_FROM+1)
>>>>>>>> transform(d[i,], YEAR=YEAR_FROM+j-1, YEAR_FROM=NULL, YEAR_TO=NULL)
>>>>>>>>
>>>>>>>>
>>>>>>>> Bill Dunlap
>>>>>>>> TIBCO Software
>>>>>>>> wdunlap tibco.com
>>>>>>>>
>>>>>>>> On Sat, Nov 25, 2017 at 11:18 AM, Hutchinson, David (EC) <
>>>>>>>> david.hutchinson at canada.ca> wrote:
>>>>>>>>
>>>>>>>> I have a returned tibble of station operational record similar
>>>>>>>> to the
>>>>>>>>> following:
>>>>>>>>>
>>>>>>>>> data.collection
>>>>>>>>>>
>>>>>>>>> # A tibble: 5 x 4
>>>>>>>>> STATION_NUMBER YEAR_FROM YEAR_TO RECORD
>>>>>>>>> <chr> <int> <int> <chr>
>>>>>>>>> 1 07EA001 1960 1960 QMS
>>>>>>>>> 2 07EA001 1961 1970 QMC
>>>>>>>>> 3 07EA001 1971 1971 QMM
>>>>>>>>> 4 07EA001 1972 1976 QMC
>>>>>>>>> 5 07EA001 1977 1983 QRC
>>>>>>>>>
>>>>>>>>> I would like to reshape this to one operational record (row)
>>>>>>>>> per year
>>>>>>>>>
>>>>>>>> per
>>>>>>>
>>>>>>>> station. Something like:
>>>>>>>>>
>>>>>>>>> 07EA001 1960 QMS
>>>>>>>>> 07EA001 1961 QMC
>>>>>>>>> 07EA001 1962 QMC
>>>>>>>>> 07EA001 1963 QMC
>>>>>>>>> ...
>>>>>>>>> 07EA001 1971 QMM
>>>>>>>>>
>>>>>>>>> Can this be done in dplyr easily?
>>>>>>>>>
>>>>>>>>> Thanks in advance,
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>>
>>>>>>>>> ______________________________________________
>>>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>>>>>> posting-guide.html
>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> [[alternative HTML version deleted]]
>>>>>>>>
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>>>>> posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>
>>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/
>>>>>>> posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>>> ng-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>>
>>>>>
>>>>> This email message may contain legally privileged
>>>>> and/or...{{dropped:2}}
>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posti
>>>>> ng-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>
>>
>> This email message may contain legally privileged and/or confidential
>> information. If you are not the intended recipient(s), or the
>> employee or agent responsible for the delivery of this message to the
>> intended recipient(s), you are hereby notified that any disclosure,
>> copying, distribution, or use of this email message is prohibited. If
>> you have received this message in error, please notify the sender
>> immediately by e-mail and delete this email message from your
>> computer. Thank you.
>>
>
This email message may contain legally privileged and/or...{{dropped:2}}
More information about the R-help
mailing list