[R] Retrieving original data frame after repetition

Fri Jul 31 14:45:15 CEST 2009

On Jul 31, 2009, at 5:52 AM, Jose Iparraguirre D'Elia wrote:

> Hi Marc (et al)
>
> I've spoken too soon...
>
> Please, have a look at this chunk of real world data.
>
> The data frame a below contains the first ten records (and first two  
> columns) of a survey dataset. It reads as follows: 1662 people have  
> an income of 279, etc. If you see lines 2 and 3, there are 1956  
> people earning 218 but there are also 489 people earning the same  
> amount. The difference between these two groups of people lies in a  
> third column, not shown. (We could think of men and women,  
> respectively, for example).
>
> a
>   income grossing
> 1     279     1662
> 2     218     1956
> 3     218      489
> 4     378      278
> 5     420      278
> 6     200      289
> 7     149      191
> 8     256     1360
> 9     269     1348
> 10   1259      900
>
>
> Now I create a vector of all people, one by one, with their  
> respective incomes, by repeating income times grossing:
>
> aa <- rep(a$income, a$grossing)
> length(aa)
> [1] 8751
>
> If I apply Marc's suggestion,
>
> z <- do.call(data.frame, rle(aa))[, c(2, 1)]
> colnames(z) <- c("x", "y")
>
> I obtain
>
> z
>     x    y
> 1  279 1662
> 2  218 2445
> 3  378  278
> 4  420  278
> 5  200  289
> 6  149  191
> 7  256 1360
> 8  269 1348
> 9 1259  900
>
> That is, lines 2 and 3 in the original data frame have been merged.
>
> How can I retrieve the original data frame a?
>
> Do I need to use that 'missing' third column? And if so, how? I've  
> read ?rle but it seems it only applies to vectors.
>
> Any help, once again, greatly appreciated...
>
> Regards,
>
> Jose

Presuming that you goal here is to generate the original raw data from  
the summarized table above, to enable descriptive summaries, I have  
previously posted a function called expand.dft(), which I post again  
below:

expand.dft <- function(x, var.names = NULL, freq = "Freq", ...)
{
   #  allow: a table object, or a data frame in frequency form
   if(inherits(x, "table"))
     x <- as.data.frame.table(x, responseName = freq)

   freq.col <- which(colnames(x) == freq)
   if (length(freq.col) == 0)
       stop(paste(sQuote("freq"), "not found in column names"))

   DF <- sapply(1:nrow(x),
                function(i) x[rep(i, each = x[i, freq.col]), ],
                simplify = FALSE)

   DF <- do.call("rbind", DF)[, -freq.col]

   for (i in 1:ncol(DF))
   {
     DF[[i]] <- type.convert(as.character(DF[[i]]), ...)

   }

   rownames(DF) <- NULL

   if (!is.null(var.names))
   {
     if (length(var.names) < dim(DF)[2])
     {
       stop(paste("Too few", sQuote("var.names"), "given."))
     } else if (length(var.names) > dim(DF)[2]) {
       stop(paste("Too many", sQuote("var.names"), "given."))
     } else {
       names(DF) <- var.names
     }
   }

   DF
}

You are going to need a 'third' column of data to differentiate your  
table when you have two successive income values that are the same,  
since this breaks the rle() based approach. Since you referenced  
gender, let's just go ahead and add third column to 'a' above so that  
at least here, the two 218 values have unique genders associated with  
them. This approach also provides a more generic process.

a3 <- cbind(a, gender = rep(c("Male", "Female"), 5))

 > a3
    income grossing gender
1     279     1662   Male
2     218     1956 Female
3     218      489   Male
4     378      278 Female
5     420      278   Male
6     200      289 Female
7     149      191   Male
8     256     1360 Female
9     269     1348   Male
10   1259      900 Female

Now expand a3, indicating that the count column to use is 'grossing':

a.expanded <- expand.dft(a3, freq = "grossing")

 > str(a.expanded)
'data.frame':	8751 obs. of  2 variables:
  $ income: int  279 279 279 279 279 279 279 279 279 279 ...
  $ gender: Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...

Notice that a.expanded has 8751 rows, which is the sum total of a 
$grossing.

Now to begin to reverse the process and return the summarized table.  
See ?as.data.frame.table

A <- as.data.frame.table(table(a.expanded), responseName = "grossing")

 > A
    income gender grossing
1     149 Female        0
2     200 Female      289
3     218 Female     1956
4     256 Female     1360
5     269 Female        0
6     279 Female        0
7     378 Female      278
8     420 Female        0
9    1259 Female      900
10    149   Male      191
11    200   Male        0
12    218   Male      489
13    256   Male        0
14    269   Male     1348
15    279   Male     1662
16    378   Male        0
17    420   Male      278
18   1259   Male        0

A <- subset(A, grossing > 0)

 > A
    income gender grossing
2     200 Female      289
3     218 Female     1956
4     256 Female     1360
7     378 Female      278
9    1259 Female      900
10    149   Male      191
12    218   Male      489
14    269   Male     1348
15    279   Male     1662
17    420   Male      278

# Note that income is returned as a factor

 > str(A)
'data.frame':	10 obs. of  3 variables:
  $ income  : Factor w/ 9 levels "149","200","218",..: 1 2 3 3 4 5 6 7  
8 9
  $ grossing: int  191 289 1956 489 1360 1348 1662 278 278 900
  $ gender  : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 2 1 2 1

# Convert 'income' from a factor back to numeric

A$income <- as.numeric(as.character(A$income))

 > str(A)
'data.frame':	10 obs. of  3 variables:
  $ income  : num  149 200 218 218 256 ...
  $ grossing: int  191 289 1956 489 1360 1348 1662 278 278 900
  $ gender  : Factor w/ 2 levels "Female","Male": 2 1 1 2 1 2 2 1 2 1

That gives you back the basic structure of the original 'a3'. Note of  
course, that the ordering of the rows is not going to be the same, but  
I presume that this is not an issue for you.

HTH,

Marc Schwartz