[R] how to ignore rows missing arguments of a function when creating a function?
Joris Meys
jorismeys at gmail.com
Mon Jun 14 16:36:15 CEST 2010
Ah, I overlooked that possibility.
You can do following :
....
not <- attr(fm$model,"na.action")
if( ! is.null(not)){ # only drop the NA values if there are any left
out of the model
cluster <- cluster[-not]
dat <- dat[-not,]
}
with(dat,{
....
On Mon, Jun 14, 2010 at 4:30 PM, edmund jones <edmund.j.jones at gmail.com> wrote:
> Thanks a lot!
>
> So, what you propose works perfectly (the "cluster" variable is indeed a
> vector), except in the case where I have no missing values in my regression.
> With the following function:
>
> cl2 <- function(dat,fm, cluster){
> attach(dat, warn.conflicts = F)
> require(sandwich)
> require(lmtest)
> not <- attr(fm$model,"na.action")
> cluster <- cluster[-not]
> with( dat[-not,] ,{
> M <- length(unique(cluster))
> N <- length(cluster)
> K <- fm$rank
> dfc <- (M/(M-1))*((N-1)/(N-K))
> uj <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum));
> vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
> coeftest(fm, vcovCL)
> }
> )
> }
>
> If I have no missing values in the arguments of fm, get the message
> "error in -not: invalid argument to unary operator"
>
> In your example:
>
>
>> x <- rnorm(100)
>> y <- c(rnorm(90),NA,rnorm(9))
>> test <- lm(x~y)
>> str(test)
> List of 13
> ... (tons of information)
> $ model :'data.frame': 99 obs. of 2 variables:
> ... (more tons of information)
> ..- attr(*, "na.action")=Class 'omit' Named int 91
> .. .. ..- attr(*, "names")= chr "91"
> - attr(*, "class")= chr "lm"
>
> Now we know that we can do :
>> not <-attr(test$model,"na.action")
>> y[-not]
>
> If I have, for example, y <- rnorm(100), I also get the error:
> "error in -not: invalid argument to unary operator"
>
>
> In my database:
>
> female income transport lunch dist reg_f
> 4900 0 18.405990 0 0 75 750
> 4901 0 NA 0 NA 75 753
> 4902 1 NA 0 1 75 752
> 4903 1 NA 0 1 75 751
> 4904 1 69.678340 1 0 74 740
> 4905 0 57.953230 1 0 73 730
> 4906 1 85.835130 0 1 68 680
> 4907 0 81.952980 0 0 75 750
> 4908 1 46.837490 1 0 74 740
> 4909 0 NA 1 0 5 52
> 4910 1 65.041360 0 1 75 750
> 4911 0 77.451870 1 0 75 750
> 4912 0 96.148590 1 0 75 750
> 4913 0 64.510410 0 0 74 740
> 4914 0 69.391230 0 0 75 750
> 4915 0 4.804243 0 1 65 650
> 4916 0 NA 0 0 75 751
> 4917 1 NA 0 0 75 751
> 4918 1 NA 0 1 40 401
> 4919 1 49.920750 0 1 76 760
> 4920 0 NA 0 1 76 763
> 4921 0 10.187910 0 0 77 770
> 4922 0 14.839710 0 1 77 770
> 4923 1 32.041000 0 0 77 770
> 4924 0 85.639440 0 0 77 770
> 4925 1 86.308410 0 0 68 680
> 4926 0 79.223910 0 0 7 70
> 4927 0 81.825800 0 0 78 780
> 4928 0 31.931000 0 1 37 370
> 4929 0 53.282310 0 1 41 410
> 4930 1 31.312910 1 1 25 250
> 4931 1 50.478870 0 1 25 250
> 4932 0 NA 0 0 66 662
> 4933 1 58.156940 0 1 31 310
> 4934 0 NA 0 1 1 13
> 4935 1 NA 0 1 1 12
> 4936 1 59.149180 0 0 3 30
> 4937 1 5.400807 0 1 5 50
> 4938 1 76.828630 0 0 6 60
> 4939 1 73.488300 0 1 63 630
> 4940 0 6.529074 0 1 6 60
> 4941 0 NA 0 0 6 61
> 4942 1 70.128530 0 0 3 30
> 4943 0 NA 0 1 53 531
> 4944 1 75.715350 1 0 6 60
> 4945 0 8.623850 0 1 24 240
> 4946 1 79.062470 0 1 62 620
> 4947 0 83.863370 1 1 11 110
> 4948 0 58.904450 0 1 62 620
> 4949 0 88.500290 0 0 9 90
> 4950 0 NA 1 1 9 90
> 4951 NA NA 0 1 15 151
>
> It works perfectly if I do
> cl2(testdata, lm(income ~ transport + reg_f ), female)
>
> but not for cl2(testdata, lm(dist ~ transport + reg_f ), female)
> Or any other case about when the arguments of the "lm" function have no
> "missing" values.
>
> How can I tell R to do this only if there is a "missing" problem?
>
> Thanks a lot for your help! Working on your previous replies has been very
> helpful in understanding how R works.
>
> Cheers,
> Edmund.
>
>
> 2010/6/13 Joris Meys <jorismeys at gmail.com>
>>
>> Next to that, Off course you have to use the right indices for
>> "cluster", but as I have no clue what the "dist" is, I just put there
>> something. So if it is a matrix, naturally you'll get errors. If I
>> make a vector "cluster" with the same length as the dataframe, giving
>> the clustering, then it works perfectly fine.
>>
>>
>>
>>
>> On Sat, Jun 12, 2010 at 4:20 PM, edmund jones <edmund.j.jones at gmail.com>
>> wrote:
>> > Hi,
>> > Thank you for your reply;
>> > I tried what you suggested, but it still didn't work.
>> > I see that my question was not precise enough; I have tried to specify
>> > it
>> > this time and I have also added data.
>>
>> I still can't run your code, because I don't have the cluster argument
>> "dist".
>>
>> > In other words: how can I tell R to do this line
>> >
>> > y <- testdata[which(apply(testdata[,c()],1, function(x)
>> > is.element(NA,x))==F
>> > ),]
>> I showed you how to "do that line". The line "not <-
>> attr(fm$model,"na.action")" gives you the row numbers of deleted
>> observations in the fit. Now you just have to use those as an index to
>> clear up any environment. Try with the "dat" maybe :
>>
>> cl <- function(dat, fm, cluster){
>> require(sandwich)
>> require(lmtest)
>> not <- attr(fm$model,"na.action")
>>
>> with( dat[-not,] ,{
>>
>> M <- length(unique(cluster))
>> N <- length(cluster)
>> K <- fm$rank
>>
>> dfc <- (M/(M-1))*((N-1)/(N-K))
>>
>> uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x,
>> cluster, sum)) ) );
>>
>> vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>>
>> coeftest(fm, vcovCL)
>> } )
>> }
>>
>> Cheers
>> Joris
>>
>>
>> >
>> > for the columns that will be the arguments of my "cl" function, so I
>> > don't
>> > have to manually ignore the observations for which I have some missing
>> > variables of interest for this function?
>> >
>> > Thank you very much.
>> >
>> > Cheers,
>> > Edmund
>> >
>> > 2010/6/9 Joris Meys <jorismeys at gmail.com>
>> >>
>> >> It's difficult to help you if we don't know what the data looks like.
>> >> Two more tips :
>> >> - look at ?with instead of attach(). The latter causes a lot of
>> >> trouble further on.
>> >> - use require() instead of library(). See also the help files on this.
>> >>
>> >> I also wonder what you're doing with the na.rm=TRUE. I don't see how
>> >> it affects the function, as it is used nowhere. Does it really remove
>> >> NA?
>> >>
>> >> This said: estfun takes the estimation function from your model
>> >> object. Your model is essentially fit on all observations that are not
>> >> NA, whereas I presume your cluster contains all observations,
>> >> including NA. Now I don't know what kind of model fm represents, but
>> >> normally there is information in the model object about the removed
>> >> observations. I illustrate this using lm :
>> >>
>> >> > x <- rnorm(100)
>> >> > y <- c(rnorm(90),NA,rnorm(9))
>> >> > test <- lm(x~y)
>> >> > str(test)
>> >> List of 13
>> >> ... (tons of information)
>> >> $ model :'data.frame': 99 obs. of 2 variables:
>> >> ... (more tons of information)
>> >> ..- attr(*, "na.action")=Class 'omit' Named int 91
>> >> .. .. ..- attr(*, "names")= chr "91"
>> >> - attr(*, "class")= chr "lm"
>> >>
>> >> Now we know that we can do :
>> >> > not <-attr(test$model,"na.action")
>> >> > y[-not]
>> >>
>> >> So try this
>> >> ### NOT TESTED
>> >> cl <- function(dat, na.rm = TRUE, fm, cluster){
>> >> require(sandwich)
>> >> require(lmtest)
>> >>
>> >> not <- attr(fm$model,"na.action")
>> >> cluster <- cluster[-not]
>> >>
>> >> with( dat ,{
>> >> M <- length(unique(cluster))
>> >> N <- length(cluster)
>> >> K <- fm$rank
>> >>
>> >> dfc <- (M/(M-1))*((N-1)/(N-K))
>> >>
>> >> uj <- data.frame(apply(estfun(fm),2, function(x)
>> >> data.frame(tapply(x,
>> >> cluster, sum)) ) );
>> >>
>> >> vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>> >>
>> >> coeftest(fm, vcovCL)
>> >> } )
>> >> }
>> >>
>> >> Cheers
>> >> Joris
>> >>
>> >> On Wed, Jun 9, 2010 at 12:06 AM, edmund jones
>> >> <edmund.j.jones at gmail.com>
>> >> wrote:
>> >> > Hi,
>> >> >
>> >> > I am relatively new to R; when creating functions, I run into
>> >> > problems
>> >> > with
>> >> > missing values. I would like my functions to ignore rows with missing
>> >> > values
>> >> > for arguments of my function) in the analysis (as for example is the
>> >> > case in
>> >> > STATA). Note that I don't want my function to drop rows if there are
>> >> > missing
>> >> > arguments elsewhere in a row, ie for variables that are not arguments
>> >> > of
>> >> > my
>> >> > function.
>> >> >
>> >> > As an example: here is a clustering function I wrote:
>> >> >
>> >> >
>> >> > cl <- function(dat, na.rm = TRUE, fm, cluster){
>> >> >
>> >> > attach( dat , warn.conflicts = F)
>> >> >
>> >> > library(sandwich)
>> >> >
>> >> > library(lmtest)
>> >> >
>> >> > M <- length(unique(cluster))
>> >> >
>> >> > N <- length(cluster)
>> >> >
>> >> > K <- fm$rank
>> >> >
>> >> > dfc <- (M/(M-1))*((N-1)/(N-K))
>> >> >
>> >> > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x,
>> >> > cluster, sum)) ) );
>> >> >
>> >> > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>> >> >
>> >> > coeftest(fm, vcovCL)
>> >> >
>> >> > }
>> >> >
>> >> >
>> >> > When I run my function, I get the message:
>> >> >
>> >> >
>> >> > Error in tapply(x, cluster, sum) : arguments must have same length
>> >> >
>> >> >
>> >> > If I specify instead attach(na.omit(dat), warn.conflicts = F) and
>> >> > don't
>> >> > have the "na.rm=TRUE" argument, then my function runs; but only for
>> >> > the
>> >> > rows
>> >> > where there are no missing values AT ALL; however, I don't care if
>> >> > there
>> >> > are
>> >> > missing values for variables on which I am not applying my function.
>> >> >
>> >> >
>> >> > For example, I have information on children's size; if I want regress
>> >> > scores
>> >> > on age and parents' education, clustering on class, I would like
>> >> > missing
>> >> > values in size not to interfere (ie if I have scores, age, parents'
>> >> > education, and class, but not size, I don't want to drop this
>> >> > observation).
>> >> >
>> >> >
>> >> > I tried to look at the code of "lm" to see how the na.action part
>> >> > works,
>> >> > but
>> >> > I couldn't figure it out... This is exactly how I would like to deal
>> >> > with
>> >> > missing values.
>> >> >
>> >> >
>> >> > I tried to write
>> >> >
>> >> > cl <- function(dat, fm, cluster, na.action){
>> >> >
>> >> > attach( dat , warn.conflicts = F)
>> >> >
>> >> > library(sandwich)
>> >> >
>> >> > library(lmtest)
>> >> >
>> >> > M <- length(unique(cluster))
>> >> >
>> >> > N <- length(cluster)
>> >> >
>> >> > K <- fm$rank
>> >> >
>> >> > dfc <- (M/(M-1))*((N-1)/(N-K))
>> >> >
>> >> > uj <- data.frame(apply(estfun(fm),2, function(x) data.frame(tapply(x,
>> >> > cluster, sum)) ) );
>> >> >
>> >> > vcovCL <- dfc*sandwich(fm, meat=crossprod(uj)/N)
>> >> >
>> >> > coeftest(fm, vcovCL)
>> >> >
>> >> > }
>> >> >
>> >> > attr(cl,"na.action") <- na.exclude
>> >> >
>> >> >
>> >> > but it still didn't work...
>> >> >
>> >> >
>> >> > Any ideas of how to deal with this issue?
>> >> >
>> >> > Thank you for your answers!
>> >> >
>> >> > Edmund
>> >> >
>> >> > [[alternative HTML version deleted]]
>> >> >
>> >> > ______________________________________________
>> >> > R-help at r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> > http://www.R-project.org/posting-guide.html
>> >> > and provide commented, minimal, self-contained, reproducible code.
>> >> >
>> >>
>> >>
>> >>
>> >> --
>> >> Joris Meys
>> >> Statistical consultant
>> >>
>> >> Ghent University
>> >> Faculty of Bioscience Engineering
>> >> Department of Applied mathematics, biometrics and process control
>> >>
>> >> tel : +32 9 264 59 87
>> >> Joris.Meys at Ugent.be
>> >> -------------------------------
>> >> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>> >
>> >
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Applied mathematics, biometrics and process control
>>
>> tel : +32 9 264 59 87
>> Joris.Meys at Ugent.be
>> -------------------------------
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>
>
--
Joris Meys
Statistical consultant
Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control
tel : +32 9 264 59 87
Joris.Meys at Ugent.be
-------------------------------
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
More information about the R-help
mailing list