reshape {stats} | R Documentation |
Reshape Grouped Data
Description
This function reshapes a data frame between ‘wide’ format (with repeated measurements in separate columns of the same row) and ‘long’ format (with the repeated measurements in separate rows).
Usage
reshape(data, varying = NULL, v.names = NULL, timevar = "time",
idvar = "id", ids = 1:NROW(data),
times = seq_along(varying[[1]]),
drop = NULL, direction, new.row.names = NULL,
sep = ".",
split = if (sep == "") {
list(regexp = "[A-Za-z][0-9]", include = TRUE)
} else {
list(regexp = sep, include = FALSE, fixed = TRUE)}
)
### Typical usage for converting from long to wide format:
# reshape(data, direction = "wide",
# idvar = "___", timevar = "___", # mandatory
# v.names = c(___), # time-varying variables
# varying = list(___)) # auto-generated if missing
### Typical usage for converting from wide to long format:
### If names of wide-format variables are in a 'nice' format
# reshape(data, direction = "long",
# varying = c(___), # vector
# sep) # to help guess 'v.names' and 'times'
### To specify long-format variable names explicitly
# reshape(data, direction = "long",
# varying = ___, # list / matrix / vector (use with care)
# v.names = ___, # vector of variable names in long format
# timevar, times, # name / values of constructed time variable
# idvar, ids) # name / values of constructed id variable
Arguments
data |
a data frame |
varying |
names of sets of variables in the wide format that
correspond to single variables in long format
(‘time-varying’). This is canonically a list of vectors of
variable names, but it can optionally be a matrix of names, or a
single vector of names. In each case, when |
v.names |
names of variables in the long format that correspond to multiple variables in the wide format. See ‘Details’. |
timevar |
the variable in long format that differentiates multiple records from the same group or individual. If more than one record matches, the first will be taken (with a warning). |
idvar |
Names of one or more variables in long format that identify multiple records from the same group/individual. These variables may also be present in wide format. |
ids |
the values to use for a newly created |
times |
the values to use for a newly created |
drop |
a vector of names of variables to drop before reshaping. |
direction |
character string, partially matched to either
|
new.row.names |
character or |
sep |
A character vector of length 1, indicating a separating
character in the variable names in the wide format. This is used for
guessing |
split |
A list with three components, |
Details
Although reshape()
can be used in a variety of contexts, the
motivating application is data from longitudinal studies, and the
arguments of this function are named and described in those terms. A
longitudinal study is characterized by repeated measurements of the
same variable(s), e.g., height and weight, on each unit being studied
(e.g., individual persons) at different time points (which are assumed
to be the same for all units). These variables are called time-varying
variables. The study may include other variables that are measured
only once for each unit and do not vary with time (e.g., gender and
race); these are called time-constant variables.
A ‘wide’ format representation of a longitudinal dataset will have one record (row) for each unit, typically with some time-constant variables that occupy single columns, and some time-varying variables that occupy multiple columns (one column for each time point). A ‘long’ format representation of the same dataset will have multiple records (rows) for each individual, with the time-constant variables being constant across these records and the time-varying variables varying across the records. The ‘long’ format dataset will have two additional variables: a ‘time’ variable identifying which time point each record comes from, and an ‘id’ variable showing which records refer to the same unit.
The type of conversion (long to wide or wide to long) is determined by
the direction
argument, which is mandatory unless the
data
argument is the result of a previous call to
reshape
. In that case, the operation can be reversed simply
using reshape(data)
(the other arguments are stored as
attributes on the data frame).
Conversion from long to wide format with direction = "wide"
is
the simpler operation, and is mainly useful in the context of
multivariate analysis where data is often expected as a wide-format
matrix. In this case, the time variable timevar
and id variable
idvar
must be specified. All other variables are assumed to be
time-varying, unless the time-varying variables are explicitly
specified via the v.names
argument. A warning is issued if
time-constant variables are not actually constant.
Each time-varying variable is expanded into multiple variables in the
wide format. The names of these expanded variables are generated
automatically, unless they are specified as the varying
argument in the form of a list (or matrix) with one component (or row)
for each time-varying variable. If varying
is a vector of
names, it is implicitly converted into a matrix, with one row for each
time-varying variable. Use this option with care if there are multiple
time-varying variables, as the ordering (by column, the default in the
matrix
constructor) may be unintuitive, whereas the
explicit list or matrix form is unambiguous.
Conversion from wide to long with direction = "long"
is the
more common operation as most (univariate) statistical modeling
functions expect data in the long format. In the simpler case where
there is only one time-varying variable, the corresponding columns in
the wide format input can be specified as the varying
argument,
which can be either a vector of column names or the corresponding
column indices. The name of the corresponding variable in the long
format output combining these columns can be optionally specified as
the v.names
argument, and the name of the time variables as the
timevar
argument. The values to use as the time values
corresponding to the different columns in the wide format can be
specified as the times
argument. If v.names
is
unspecified, the function will attempt to guess v.names
and
times
from varying
(an explicitly specified times
argument is unused in that case). The default expects variable names
like x.1
, x.2
, where sep = "."
specifies to
split at the dot and drop it from the name. To have alphabetic
followed by numeric times use sep = ""
.
Multiple time-varying variables can be specified in two ways, either
with varying
as an atomic vector as above, or as a list (or a
matrix). The first form is useful (and mandatory) if the automatic
variable name splitting as described above is used; this requires the
names of all time-varying variables to be suitably formatted in the
same manner, and v.names
to be unspecified. If varying
is a list (with one component for each time-varying variable) or a
matrix (one row for each time-varying variable), variable name
splitting is not attempted, and v.names
and times
will
generally need to be specified, although they will default to,
respectively, the first variable name in each set, and sequential
times.
Also, guessing is not attempted if v.names
is given explicitly,
even if varying
is an atomic vector. In that case, the number
of time-varying variables is taken to be the length of v.names
,
and varying
is implicitly converted into a matrix, with one row
for each time-varying variable. As in the case of long to wide
conversion, the matrix is filled up by column, so careful attention needs
to be paid to the order of variable names (or indices) in
varying
, which is taken to be like x.1
, y.1
,
x.2
, y.2
(i.e., variables corresponding to the same time
point need to be grouped together).
The split
argument should not usually be necessary. The
split$regexp
component is passed to either
strsplit
or regexpr
, where the latter is
used if split$include
is TRUE
, in which case the
splitting occurs after the first character of the matched string. In
the strsplit
case, the separator is not included in the
result, and it is possible to specify fixed-string matching using
split$fixed
.
Value
The reshaped data frame with added attributes to simplify reshaping back to the original form.
See Also
stack
, aperm
;
relist
for reshaping the result of
unlist
. xtabs
and
as.data.frame.table
for creating contingency tables and
converting them back to data frames.
Examples
summary(Indometh) # data in long format
## long to wide (direction = "wide") requires idvar and timevar at a minimum
reshape(Indometh, direction = "wide", idvar = "Subject", timevar = "time")
## can also explicitly specify name of combined variable
wide <- reshape(Indometh, direction = "wide", idvar = "Subject",
timevar = "time", v.names = "conc", sep= "_")
wide
## reverse transformation
reshape(wide, direction = "long")
reshape(wide, idvar = "Subject", varying = list(2:12),
v.names = "conc", direction = "long")
## times need not be numeric
df <- data.frame(id = rep(1:4, rep(2,4)),
visit = rep(c("Before","After"), 4),
x = rnorm(4), y = runif(4))
df
reshape(df, timevar = "visit", idvar = "id", direction = "wide")
## warns that y is really varying
reshape(df, timevar = "visit", idvar = "id", direction = "wide", v.names = "x")
## unbalanced 'long' data leads to NA fill in 'wide' form
df2 <- df[1:7, ]
df2
reshape(df2, timevar = "visit", idvar = "id", direction = "wide")
## Alternative regular expressions for guessing names
df3 <- data.frame(id = 1:4, age = c(40,50,60,50), dose1 = c(1,2,1,2),
dose2 = c(2,1,2,1), dose4 = c(3,3,3,3))
reshape(df3, direction = "long", varying = 3:5, sep = "")
## an example that isn't longitudinal data
state.x77 <- as.data.frame(state.x77)
long <- reshape(state.x77, idvar = "state", ids = row.names(state.x77),
times = names(state.x77), timevar = "Characteristic",
varying = list(names(state.x77)), direction = "long")
reshape(long, direction = "wide")
reshape(long, direction = "wide", new.row.names = unique(long$state))
## multiple id variables
df3 <- data.frame(school = rep(1:3, each = 4), class = rep(9:10, 6),
time = rep(c(1,1,2,2), 3), score = rnorm(12))
wide <- reshape(df3, idvar = c("school", "class"), direction = "wide")
wide
## transform back
reshape(wide)