[R] Retrieving original data frame after repetition
Marc Schwartz
marc_schwartz at me.com
Fri Jul 31 14:45:15 CEST 2009
On Jul 31, 2009, at 5:52 AM, Jose Iparraguirre D'Elia wrote:
> Hi Marc (et al)
>
> I've spoken too soon...
>
> Please, have a look at this chunk of real world data.
>
> The data frame a below contains the first ten records (and first two
> columns) of a survey dataset. It reads as follows: 1662 people have
> an income of 279, etc. If you see lines 2 and 3, there are 1956
> people earning 218 but there are also 489 people earning the same
> amount. The difference between these two groups of people lies in a
> third column, not shown. (We could think of men and women,
> respectively, for example).
>
> a
> income grossing
> 1 279 1662
> 2 218 1956
> 3 218 489
> 4 378 278
> 5 420 278
> 6 200 289
> 7 149 191
> 8 256 1360
> 9 269 1348
> 10 1259 900
>
>
> Now I create a vector of all people, one by one, with their
> respective incomes, by repeating income times grossing:
>
> aa <- rep(a$income, a$grossing)
> length(aa)
> [1] 8751
>
> If I apply Marc's suggestion,
>
> z <- do.call(data.frame, rle(aa))[, c(2, 1)]
> colnames(z) <- c("x", "y")
>
> I obtain
>
> z
> x y
> 1 279 1662
> 2 218 2445
> 3 378 278
> 4 420 278
> 5 200 289
> 6 149 191
> 7 256 1360
> 8 269 1348
> 9 1259 900
>
> That is, lines 2 and 3 in the original data frame have been merged.
>
> How can I retrieve the original data frame a?
>
> Do I need to use that 'missing' third column? And if so, how? I've
> read ?rle but it seems it only applies to vectors.
>
> Any help, once again, greatly appreciated...
>
> Regards,
>
> Jose
Presuming that you goal here is to generate the original raw data from
the summarized table above, to enable descriptive summaries, I have
previously posted a function called expand.dft(), which I post again
below:
expand.dft <- function(x, var.names = NULL, freq = "Freq", ...)
{
# allow: a table object, or a data frame in frequency form
if(inherits(x, "table"))
x <- as.data.frame.table(x, responseName = freq)
freq.col <- which(colnames(x) == freq)
if (length(freq.col) == 0)
stop(paste(sQuote("freq"), "not found in column names"))
DF <- sapply(1:nrow(x),
function(i) x[rep(i, each = x[i, freq.col]), ],
simplify = FALSE)
DF <- do.call("rbind", DF)[, -freq.col]
for (i in 1:ncol(DF))
{
DF[[i]] <- type.convert(as.character(DF[[i]]), ...)
}
rownames(DF) <- NULL
if (!is.null(var.names))
{
if (length(var.names) < dim(DF)[2])
{
stop(paste("Too few", sQuote("var.names"), "given."))
} else if (length(var.names) > dim(DF)[2]) {
stop(paste("Too many", sQuote("var.names"), "given."))
} else {
names(DF) <- var.names
}
}
DF
}
You are going to need a 'third' column of data to differentiate your
table when you have two successive income values that are the same,
since this breaks the rle() based approach. Since you referenced
gender, let's just go ahead and add third column to 'a' above so that
at least here, the two 218 values have unique genders associated with
them. This approach also provides a more generic process.
a3 <- cbind(a, gender = rep(c("Male", "Female"), 5))
> a3
income grossing gender
1 279 1662 Male
2 218 1956 Female
3 218 489 Male
4 378 278 Female
5 420 278 Male
6 200 289 Female
7 149 191 Male
8 256 1360 Female
9 269 1348 Male
10 1259 900 Female
Now expand a3, indicating that the count column to use is 'grossing':
a.expanded <- expand.dft(a3, freq = "grossing")
> str(a.expanded)
'data.frame': 8751 obs. of 2 variables:
$ income: int 279 279 279 279 279 279 279 279 279 279 ...
$ gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
Notice that a.expanded has 8751 rows, which is the sum total of a
$grossing.
Now to begin to reverse the process and return the summarized table.
See ?as.data.frame.table
A <- as.data.frame.table(table(a.expanded), responseName = "grossing")
> A
income gender grossing
1 149 Female 0
2 200 Female 289
3 218 Female 1956
4 256 Female 1360
5 269 Female 0
6 279 Female 0
7 378 Female 278
8 420 Female 0
9 1259 Female 900
10 149 Male 191
11 200 Male 0
12 218 Male 489
13 256 Male 0
14 269 Male 1348
15 279 Male 1662
16 378 Male 0
17 420 Male 278
18 1259 Male 0
A <- subset(A, grossing > 0)
> A
income gender grossing
2 200 Female 289
3 218 Female 1956
4 256 Female 1360
7 378 Female 278
9 1259 Female 900
10 149 Male 191
12 218 Male 489
14 269 Male 1348
15 279 Male 1662
17 420 Male 278
# Note that income is returned as a factor
> str(A)
'data.frame': 10 obs. of 3 variables:
$ income : Factor w/ 9 levels "149","200","218",..: 1 2 3 3 4 5 6 7
8 9
$ grossing: int 191 289 1956 489 1360 1348 1662 278 278 900
$ gender : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 2 1 2 1
# Convert 'income' from a factor back to numeric
A$income <- as.numeric(as.character(A$income))
> str(A)
'data.frame': 10 obs. of 3 variables:
$ income : num 149 200 218 218 256 ...
$ grossing: int 191 289 1956 489 1360 1348 1662 278 278 900
$ gender : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 2 1 2 1
That gives you back the basic structure of the original 'a3'. Note of
course, that the ordering of the rows is not going to be the same, but
I presume that this is not an issue for you.
HTH,
Marc Schwartz
More information about the R-help
mailing list