[R] Add a column to a dataframe based on multiple other column values

Wed Jun 12 13:12:18 CEST 2013

Tom,

Here is my solution. Note that I assume the columns are interleaved as you describe below. I'm sure others will have better replies.

Note that using dput helps the helpers.

# From dput(mdat)
mdat<-structure(list(x1 = c(2L, 2L, 2L, 3L, 3L, 30L, 32L, 33L, 33L), 
    y1 = c(100L, 100L, 100L, 0L, 0L, 0L, 100L, 82L, 0L), x2 = c(190L, 
    192L, 192L, 195L, 198L, 198L, 868L, 870L, 871L), y2 = c(99L, 
    63L, 63L, 99L, 98L, 100L, 100L, 100L, 82L), x3 = c(1430L, 
    1431L, 1444L, 1499L, 1500L, 1451L, 1451L, 1490L, 1494L), 
    y3 = c(79L, 75L, 51L, 50L, 80L, 97L, 97L, 97L, 85L), output = c(89, 
    69, 57, 74.5, 89, 65.66666667, 99, 93, 55.66666667)), .Names = c("x1", 
"y1", "x2", "y2", "x3", "y3", "output"), class = "data.frame", row.names = c(NA, 
-9L))

mdat.pure<-mdat[,-ncol(mdat)]

# Function to apply to rows
theFunk<-function(x) {
  nxy<-length(x)/2
  idx<-seq_len(nxy)
  xvec<-x[idx*2 - 1]
  yvec<-x[idx*2]
  mean(yvec[xvec>10])
}

# Apply the function to rows
output<-apply(mdat.pure, 1, theFunk)

Or 

mdat.pure$output<-apply(mdat.pure, 1, theFunk)

will put the calculated column at the end of mdat.pure.

Note that I haven't taken account of missing values.

Hope this helps,
KW

--

On Jun 12, 2013, at 6:00 AM, r-help-request at r-project.org wrote:

> Message: 35
> Date: Tue, 11 Jun 2013 17:07:12 +0100
> From: Tom Oates <toates19 at gmail.com>
> To: r-help at r-project.org
> Subject: [R] Add a column to a dataframe based on multiple other
> 	column	values
> Message-ID:
> 	<CAGUdn1CxLfxXNzDWQUo515h_h5qeKFMUyG5MsDb1qn6gBQ7cVg at mail.gmail.com>
> Content-Type: text/plain
> 
> Hi
> I have a dataframe as below:
> 
> x1    y1    x2    y2    x3    y3    output
> 2    100    190    99    1430    79    89
> 2    100    192    63    1431    75    69
> 2    100    192    63    1444    51    57
> 3    0    195    99    1499    50    74.5
> 3    0    198    98    1500    80    89
> 30    0    198    100    1451    97    65.66666667
> 32    100    868    100    1451    97    99
> 33    82    870    100    1490    97    93
> 33    0    871    82    1494    85    55.66666667
> 
> 
> In reality the dataframe has pairs of columns x & y up to a large number.
> As you can see from the column labelled output in the dataframe; I want to
> calculate the mean of each row of the yn columns, but only to include each
> yn value in the calculation of the mean if the corresponding xn column
> value is greater than 10.
> So for row 1; you will see that only y2 & y3 are included in calculating
> the output column, but for row 6 y1-y3 are all included.
> Because the number of paired x & y columns is large I am not sure the best
> way to achieve this.
> Thanks in advance
> Tom