[R] R newbie | sapply and FUN error

Fri May 21 00:23:48 CEST 2010

On May 20, 2010, at 4:42 PM, egc wrote:

> Greetings -
> 
> While I've used R a fair bit for basic statistical machinations, I've
> not used it for data manipulation - I've used SAS for 20+ years (and
> SAS real shines in data handling). So, I've started the process of
> trying to figure out 'how to do in R what I can do in my sleep in SAS'
> - specifically wrt to data manipulating. So, these are decidely
> 'newbie' level questions.
> 
> So, starting very simple. Created a tine example CSV file, which I
> call test.csv.
> 
> Loc,cost
> A,1
> C,3
> D,2
> F,3
> H,4
> K,3
> M,8
> 
> Now, all I want to do is read it in, and derive a new variable which
> is a Z-transform of 'cost'. Here is what I've tried so far:
> 
>> prices <- read.csv("c:/documents and settings/user/desktop/test.csv",header=TRUE,sep=",",na.strings=".");
>>  print(prices$cost);
> 
> So far, so good (being able to pull in the data is a good thing).
> 
> Now, while I'm sure there are lots of ways to do what I want, I'm
> going to brute force it, by calculating column mean and column SD for
> 'cost', generate the Z-transformed value, and then add it to the
> dataframe. However, here is where I'm having problems. After about an
> hour of searching, I realized I need to use an 'apply' function to
> apply a function (say, mean) to a column in a dataframe. But, I can
> seem to get it to work successfully (and this is the gist of the
> question).
> 
> If I try
> 
>> result <- sapply(prices['cost'],MARGIN=2,FUN=mean,na.rm=TRUE);
>> print(result);
> 
> 
> Works perfectly.
> 
> But, if I simply change FUN=mean to FUN=sd, not so successful:
> 
> If I try
> 
>> result <- sapply(prices['cost'],MARGIN=2,FUN=sd,na.rm=TRUE);
>> print(result);
> 
> Throws the following error:
> 
> Error in FUN(X[[1L]], ...) : unused argument(s) (MARGIN = 2)
> 
> Further, If I try
> 
>> result <- sapply(prices$cost,MARGIN=2,FUN=mean,na.rm=TRUE);
>> print(result);
> 
> it prints 8 values corresponding to the value of each element of the
> data set - meaning, its treating prices$cost as a row vector.Which
> makes no sense to me. What do I have to do to use prices$cost as the
> first argument in the sapply call? If I can't, why not?
> is.vector(prices$cost) shows TRUE, so why can't I take the mean over
> the vector?
> 
> At any rate, I'll start from here. Being able to apply functions to
> column(s) of a dataframe seems pretty fundamental, so I'd like to
> start by understanding the basics.
> 
> Thanks in advance.

First, welcome to R.

Second, you are using the argument 'MARGIN', which is actually used in the apply() function, not in sapply(). Hence the error messages and arguably, the unpredictable behavior.

One of the key concepts with R, as opposed to SAS, is that in R, you take a 'holistic' view of objects, not an element-by-element view. So for many operations, R's functions are 'vectorized', which means that they can operate on an entire object (eg. a column in a data frame) with a single function call. 

So in this case:

> mean(prices$cost)
[1] 3.428571

> sd(prices$cost)
[1] 2.225395

gets you want you want. There is also more than one way of accessing the data. For example:

> mean(prices[, "cost"])
[1] 3.428571

> mean(prices[["cost"]])
[1] 3.428571

and

> mean(prices["cost"])
    cost 
3.428571

Note that in the last example, the result is 'named'.  Each of these have to do with the structure of a data frame, which is covered in the manuals and help files, for example: ?Extract and the 'See Also' links on that page.

There is no need to loop over each element in the column using one of the *apply() functions.

If you have not, I would recommend reading An Introduction to R, which is available via the main R web site in the Manuals section, or it also installed with R on your computer.

Additionally, an excellent resource for folks coming from SAS to R, is available at:

  http://RforSASandSPSSusers.com/

The authors have provided a terrific review of how one performs common operations in R, that you are already comfortable doing in SAS.

HTH,

Marc Schwartz