[R] how to ignore rows missing arguments of a function when creating a function?

Mon Jun 14 16:36:15 CEST 2010

Ah, I overlooked that possibility.

You can do following :
....
not <- attr(fm$model,"na.action")
if( ! is.null(not)){   # only drop the NA values if there are any left
out of the model
   cluster <- cluster[-not]
   dat <- dat[-not,]
}
with(dat,{
....

On Mon, Jun 14, 2010 at 4:30 PM, edmund jones <edmund.j.jones at gmail.com> wrote:
> Thanks a lot!
>
> So, what you propose works perfectly (the "cluster" variable is indeed a
> vector), except in the case where I have no missing values in my regression.
> With the following function:
>
> cl2 <- function(dat,fm, cluster){
> attach(dat, warn.conflicts = F)
> require(sandwich)
> require(lmtest)
> not <- attr(fm$model,"na.action")
> cluster <- cluster[-not]
> with( dat[-not,] ,{
> M <- length(unique(cluster))
> N <- length(cluster)
> K <- fm$rank
> dfc <- (M/(M-1))*((N-1)/(N-K))
> uj <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
> vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
> coeftest(fm, vcovCL)
> }
> )
> }
>
> If I have no missing values in the arguments of fm, get the message
> "error in -not: invalid argument to unary operator"
>
> In your example:
>
>
>> x <- rnorm(100)
>> y <- c(rnorm(90),NA,rnorm(9))
>> test <- lm(x~y)
>> str(test)
> List of 13
>  ... (tons of information)
>  $ model        :'data.frame':  99 obs. of  2 variables:
>  ... (more tons of information)
>  ..- attr(*, "na.action")=Class 'omit'  Named int 91
>  .. .. ..- attr(*, "names")= chr "91"
>  - attr(*, "class")= chr "lm"
>
> Now we know that we can do :
>> not <-attr(test$model,"na.action")
>> y[-not]
>
> If I have, for example, y <- rnorm(100), I also get the error:
> "error in -not: invalid argument to unary operator"
>
>
> In my database:
>
>      female    income transport lunch dist reg_f
> 4900      0 18.405990         0     0   75   750
> 4901      0        NA         0    NA   75   753
> 4902      1        NA         0     1   75   752
> 4903      1        NA         0     1   75   751
> 4904      1 69.678340         1     0   74   740
> 4905      0 57.953230         1     0   73   730
> 4906      1 85.835130         0     1   68   680
> 4907      0 81.952980         0     0   75   750
> 4908      1 46.837490         1     0   74   740
> 4909      0        NA         1     0    5    52
> 4910      1 65.041360         0     1   75   750
> 4911      0 77.451870         1     0   75   750
> 4912      0 96.148590         1     0   75   750
> 4913      0 64.510410         0     0   74   740
> 4914      0 69.391230         0     0   75   750
> 4915      0  4.804243         0     1   65   650
> 4916      0        NA         0     0   75   751
> 4917      1        NA         0     0   75   751
> 4918      1        NA         0     1   40   401
> 4919      1 49.920750         0     1   76   760
> 4920      0        NA         0     1   76   763
> 4921      0 10.187910         0     0   77   770
> 4922      0 14.839710         0     1   77   770
> 4923      1 32.041000         0     0   77   770
> 4924      0 85.639440         0     0   77   770
> 4925      1 86.308410         0     0   68   680
> 4926      0 79.223910         0     0    7    70
> 4927      0 81.825800         0     0   78   780
> 4928      0 31.931000         0     1   37   370
> 4929      0 53.282310         0     1   41   410
> 4930      1 31.312910         1     1   25   250
> 4931      1 50.478870         0     1   25   250
> 4932      0        NA         0     0   66   662
> 4933      1 58.156940         0     1   31   310
> 4934      0        NA         0     1    1    13
> 4935      1        NA         0     1    1    12
> 4936      1 59.149180         0     0    3    30
> 4937      1  5.400807         0     1    5    50
> 4938      1 76.828630         0     0    6    60
> 4939      1 73.488300         0     1   63   630
> 4940      0  6.529074         0     1    6    60
> 4941      0        NA         0     0    6    61
> 4942      1 70.128530         0     0    3    30
> 4943      0        NA         0     1   53   531
> 4944      1 75.715350         1     0    6    60
> 4945      0  8.623850         0     1   24   240
> 4946      1 79.062470         0     1   62   620
> 4947      0 83.863370         1     1   11   110
> 4948      0 58.904450         0     1   62   620
> 4949      0 88.500290         0     0    9    90
> 4950      0        NA         1     1    9    90
> 4951     NA        NA         0     1   15   151
>
> It works perfectly if I do
> cl2(testdata, lm(income ~ transport + reg_f ), female)
>
> but not for cl2(testdata, lm(dist ~ transport + reg_f ), female)
> Or any other case about when the arguments of the "lm" function have no
> "missing" values.
>
> How can I tell R to do this only if there is a "missing" problem?
>
> Thanks a lot for your help! Working on your previous replies has been very
> helpful in understanding how R works.
>
> Cheers,
> Edmund.
>
>
> 2010/6/13 Joris Meys <jorismeys at gmail.com>
>>
>>  Next to that, Off course you have to use the right  indices for
>> "cluster", but as I have no clue what the "dist" is, I just put there
>> something. So if it is a matrix, naturally you'll get errors. If I
>> make a vector "cluster" with the same length as the dataframe, giving
>> the clustering, then it works perfectly fine.
>>
>>
>>
>>
>> On Sat, Jun 12, 2010 at 4:20 PM, edmund jones <edmund.j.jones at gmail.com>
>> wrote:
>> > Hi,
>> > Thank you for your reply;
>> > I tried what you suggested, but it still didn't work.
>> > I see that my question was not precise enough; I have tried to specify
>> > it
>> > this time and I have also added data.
>>
>> I still can't run your code, because I don't have the cluster argument
>> "dist".
>>
>> > In other words: how can I tell R to do this line
>> >
>> > y <- testdata[which(apply(testdata[,c()],1, function(x)
>> > is.element(NA,x))==F
>> > ),]
>> I showed you how to "do that line". The line "not <-
>> attr(fm$model,"na.action")" gives you the row numbers of deleted
>> observations in the fit. Now you just have to use those as an index to
>> clear up any environment.  Try with the "dat" maybe :
>>
>> cl <- function(dat, fm, cluster){
>> require(sandwich)
>> require(lmtest)
>>  not <- attr(fm$model,"na.action")
>>
>> with( dat[-not,] ,{
>>
>>   M <- length(unique(cluster))
>>   N <- length(cluster)
>>   K <- fm$rank
>>
>>   dfc <- (M/(M-1))*((N-1)/(N-K))
>>
>>   uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x,
>>   cluster, sum)) ) );
>>
>>   vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>>
>>   coeftest(fm, vcovCL)
>>   } )
>> }
>>
>> Cheers
>> Joris
>>
>>
>> >
>> > for the columns that will be the arguments of my "cl" function, so I
>> > don't
>> > have to manually ignore the observations for which I have some missing
>> > variables of interest for this function?
>> >
>> > Thank you very much.
>> >
>> > Cheers,
>> > Edmund
>> >
>> > 2010/6/9 Joris Meys <jorismeys at gmail.com>
>> >>
>> >> It's difficult to help you if we don't know what the data looks like.
>> >> Two more tips :
>> >> - look at ?with instead of attach(). The latter causes a lot of
>> >> trouble further on.
>> >> - use require() instead of library(). See also the help files on this.
>> >>
>> >> I also wonder what you're doing with the na.rm=TRUE. I don't see how
>> >> it affects the function, as it is used nowhere. Does it really remove
>> >> NA?
>> >>
>> >> This said: estfun takes the estimation function from your model
>> >> object. Your model is essentially fit on all observations that are not
>> >> NA, whereas I presume your cluster contains all observations,
>> >> including NA. Now I don't know what kind of model fm represents, but
>> >> normally there is information in the model object about the removed
>> >> observations. I illustrate this using lm :
>> >>
>> >> > x <- rnorm(100)
>> >> > y <- c(rnorm(90),NA,rnorm(9))
>> >> > test <- lm(x~y)
>> >> > str(test)
>> >> List of 13
>> >>  ... (tons of information)
>> >>  $ model        :'data.frame':  99 obs. of  2 variables:
>> >>  ... (more tons of information)
>> >>  ..- attr(*, "na.action")=Class 'omit'  Named int 91
>> >>  .. .. ..- attr(*, "names")= chr "91"
>> >>  - attr(*, "class")= chr "lm"
>> >>
>> >> Now we know that we can do :
>> >> > not <-attr(test$model,"na.action")
>> >> > y[-not]
>> >>
>> >> So try this
>> >> ### NOT TESTED
>> >> cl <- function(dat, na.rm = TRUE, fm, cluster){
>> >> require(sandwich)
>> >> require(lmtest)
>> >>
>> >> not <- attr(fm$model,"na.action")
>> >> cluster <- cluster[-not]
>> >>
>> >> with( dat ,{
>> >>    M <- length(unique(cluster))
>> >>    N <- length(cluster)
>> >>    K <- fm$rank
>> >>
>> >>    dfc <- (M/(M-1))*((N-1)/(N-K))
>> >>
>> >>    uj <- data.frame(apply(estfun(fm),2, function(x)
>> >> data.frame(tapply(x,
>> >>    cluster, sum)) ) );
>> >>
>> >>    vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>> >>
>> >>    coeftest(fm, vcovCL)
>> >>    } )
>> >> }
>> >>
>> >> Cheers
>> >> Joris
>> >>
>> >> On Wed, Jun 9, 2010 at 12:06 AM, edmund jones
>> >> <edmund.j.jones at gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I am relatively new to R; when creating functions, I run into
>> >> > problems
>> >> > with
>> >> > missing values. I would like my functions to ignore rows with missing
>> >> > values
>> >> > for arguments of my function) in the analysis (as for example is the
>> >> > case in
>> >> > STATA). Note that I don't want my function to drop rows if there are
>> >> > missing
>> >> > arguments elsewhere in a row, ie for variables that are not arguments
>> >> > of
>> >> > my
>> >> > function.
>> >> >
>> >> > As an example: here is a clustering function I wrote:
>> >> >
>> >> >
>> >> > cl <- function(dat, na.rm = TRUE, fm, cluster){
>> >> >
>> >> > attach( dat , warn.conflicts = F)
>> >> >
>> >> > library(sandwich)
>> >> >
>> >> > library(lmtest)
>> >> >
>> >> > M <- length(unique(cluster))
>> >> >
>> >> > N <- length(cluster)
>> >> >
>> >> > K <- fm$rank
>> >> >
>> >> > dfc <- (M/(M-1))*((N-1)/(N-K))
>> >> >
>> >> > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x,
>> >> > cluster, sum)) ) );
>> >> >
>> >> > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>> >> >
>> >> > coeftest(fm, vcovCL)
>> >> >
>> >> > }
>> >> >
>> >> >
>> >> > When I run my function, I get the message:
>> >> >
>> >> >
>> >> > Error in tapply(x, cluster, sum) : arguments must have same length
>> >> >
>> >> >
>> >> > If I specify instead attach(na.omit(dat), warn.conflicts = F)  and
>> >> > don't
>> >> > have the "na.rm=TRUE" argument, then my function runs; but only for
>> >> > the
>> >> > rows
>> >> > where there are no missing values AT ALL; however, I don't care if
>> >> > there
>> >> > are
>> >> > missing values for variables on which I am not applying my function.
>> >> >
>> >> >
>> >> > For example, I have information on children's size; if I want regress
>> >> > scores
>> >> > on age and parents' education, clustering on class, I would like
>> >> > missing
>> >> > values in size not to interfere (ie if I have scores, age, parents'
>> >> > education, and class, but not size, I don't want to drop this
>> >> > observation).
>> >> >
>> >> >
>> >> > I tried to look at the code of "lm" to see how the na.action part
>> >> > works,
>> >> > but
>> >> > I couldn't figure it out... This is exactly how I would like to deal
>> >> > with
>> >> > missing values.
>> >> >
>> >> >
>> >> > I tried to write
>> >> >
>> >> > cl <- function(dat, fm, cluster, na.action){
>> >> >
>> >> > attach( dat , warn.conflicts = F)
>> >> >
>> >> > library(sandwich)
>> >> >
>> >> > library(lmtest)
>> >> >
>> >> >  M <- length(unique(cluster))
>> >> >
>> >> >  N <- length(cluster)
>> >> >
>> >> >  K <- fm$rank
>> >> >
>> >> >  dfc <- (M/(M-1))*((N-1)/(N-K))
>> >> >
>> >> > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x,
>> >> > cluster, sum)) ) );
>> >> >
>> >> > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>> >> >
>> >> >  coeftest(fm, vcovCL)
>> >> >
>> >> > }
>> >> >
>> >> >  attr(cl,"na.action") <- na.exclude
>> >> >
>> >> >
>> >> > but it still didn't work...
>> >> >
>> >> >
>> >> > Any ideas of how to deal with this issue?
>> >> >
>> >> > Thank you for your answers!
>> >> >
>> >> > Edmund
>> >> >
>> >> >        [[alternative HTML version deleted]]
>> >> >
>> >> > ______________________________________________
>> >> > R-help at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> > http://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained, reproducible code.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Joris Meys
>> >> Statistical consultant
>> >>
>> >> Ghent University
>> >> Faculty of Bioscience Engineering
>> >> Department of Applied mathematics, biometrics and process control
>> >>
>> >> tel : +32 9 264 59 87
>> >> Joris.Meys at Ugent.be
>> >> -------------------------------
>> >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>> >
>> >
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Applied mathematics, biometrics and process control
>>
>> tel : +32 9 264 59 87
>> Joris.Meys at Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>

-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php