[R] Data transformation problem
Jeff Newmiller
jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Thu Nov 12 08:20:32 CET 2020
I am not a data.table afficiando, but here is how I would do it with
dplyr/tidyr:
library(dplyr)
library(tidyr)
do_per_REL <- function( DF ) {
rng <- range( DF$REF1 ) # watch out for missing months?
DF <- ( data.frame( REF1 = seq( rng[ 1 ], rng[ 2 ], by = "month" ) )
%>% left_join( DF, by = "REF1" )
%>% arrange( REF1 )
)
with( DF
, data.frame( REF2 = REF1[ -1 ]
, VAL2 = 100 * diff( VAL1 ) / VAL1[ -length( VAL1 ) ]
)
)
}
df2a <- ( df1
%>% mutate( REF1 = as.Date( REF1 )
, REL1 = as.Date( REL1 )
)
%>% nest( data = -REL1 )
%>% rename( REL2 = REL1 )
%>% rowwise()
%>% mutate( data = list( do_per_REL( data ) ) )
%>% ungroup()
%>% unnest( cols = "data" )
%>% select( REF2, REL2, VAL2 )
%>% arrange( REF2, desc( REL2 ), VAL2 )
)
df2a
On Wed, 11 Nov 2020, phil using philipsmith.ca wrote:
> I am stuck on a data transformation problem. I have a data frame, df1 in my
> example, with some original "levels" data. The data pertain to some variable,
> such as GDP, in various reference periods, REF, as estimated and released in
> various release periods, REL. The release periods follow after the reference
> periods by two months or more, sometimes by several years. I want to build a
> second data frame, called df2 in my example, with the month-to-month growth
> rates that existed in each reference period, revealing the revisions to those
> growth rates in subsequent periods.
>
> REF1 <- c("2017-01-01","2017-01-01","2017-01-01","2017-01-01","2017-01-01",
> "2017-02-01","2017-02-01","2017-02-01","2017-02-01","2017-02-01",
> "2017-03-01","2017-03-01","2017-03-01","2017-03-01","2017-03-01")
> REL1 <- c("2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",
> "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",
> "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01")
> VAL1 <-
> c(17974,14567,13425,NA,12900,17974,14000,14000,12999,13245,17197,11500,
> 19900,18765,13467)
> df1 <- data.frame(REF1,REL1,VAL1)
> REF2 <- c("2017-02-01","2017-02-01","2017-02-01","2017-02-01","2017-02-01",
> "2017-03-01","2017-03-01","2017-03-01","2017-03-01","2017-03-01")
> REL2 <- c("2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01",
> "2020-09-01","2020-08-01","2020-07-01","2020-06-01","2019-05-01")
> VAL2 <- c(0.0,-3.9,4.3,NA,2.3,-4.3,-17.9,42.1,44.4,1.7)
> df2 <- data.frame(REF2,REL2,VAL2)
>
> In my example I have provided some sample data pertaining to three reference
> months, 2017-01-01 through 2017-03-01, and five release periods,
> "2020-09-01","2020-08-01","2020-07-01","2020-06-01" and "2019-05-01". In my
> actual problem I have millions of REF-REL combinations, so my data frame is
> quite large. I am using data.table for faster processing, though I am more
> familiar with the tidyverse. I am providing df2 as the target data frame for
> my example, so you can see what I am trying to achieve.
>
> I have not been able to find an efficient way to do these calculations. I
> have tried "for" loops with "if" statements, without success so far, and
> anyway this approach would be too slow, I fear. Suggestions as to how I might
> proceed would be much appreciated.
>
> Philip
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnewmil using dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
More information about the R-help
mailing list