[R] Problem with "apply"

Tobias Verbeke tobias.verbeke at telenet.be
Wed Apr 22 21:23:28 CEST 2009


Marc Schwartz wrote:

> The cut() function will do what you want in a vectorized fashion. See ?cut
> 
> However, that being said, I would strongly advise that you read Frank's 
> page on the categorizing of continuous variables:
> 
>   http://biostat.mc.vanderbilt.edu/twiki/bin/view/Main/CatContinuous
> 
> before you proceed.

A simple example of how to use it for your problem would be

set.seed(158)
ages <- sample(0:100, 50, TRUE)
head(ages)
ageGroups <- cut(ages, breaks = c(-1,5,15,30,70,80,150), right = FALSE,
     labels = c("0-4", "5-14", "15-29", "30-69", "70-79", "80+"))
head(ageGroups)

See ?cut

HTH,
Tobias

> On Apr 22, 2009, at 1:56 PM, Alan Cohen wrote:
> 
>> Hi R users,
>>
>> I am trying to assign ages to age classes for a large data set 
>> (123,000 records), and using a for-loop was too slow, so I wrote a 
>> function and used apply.  However, the function does not properly 
>> assign the first two classes (the rest are fine).  It appears that 
>> when age is one digit, it does not get assigned properly.
>>
>> I tried to provide a small-scale work-up (at the end of the email) but 
>> it does not reproduce the problem; the best I can do is to provide my 
>> code and the output below.  As you can see, I've confirmed that age is 
>> numeric, that all values are integers, and that pieces of the code 
>> work independently.  Any thoughts would be appreciated.
>>
>> To add to the mystery, depending which rows of my data set I select, I 
>> get different problems.  mds[1:100,] gives the problem above, as do 
>> mds[100:200,] , mds[150:250,] and mds[10000:10100,].  However, with 
>> mds[200:300,], mds[250:350,] and mds[1000:1100,], only ages with 3 
>> digits are correctly assigned - all ages <100 are returned as NA.
>>
>> I'm using R v 2.8.1 on Windows XP.
>>
>> Cheers,
>> Alan Cohen
>> Centre for Global Health Research,
>> Toronto,ON
>>
>>> ageassign <- function(x){
>> +   y <- NA
>> +   if (x[11] %in% c(0:4)) {y <- "0-4"}
>> +   else if (x[11] %in% c(5:14)) {y <- "5-14" }
>> +   else if (x[11] %in% c(15:29)) {y <- "15-29" }
>> +   else if (x[11] %in% c(30:69)) {y <- "30-69"}
>> +   else if (x[11] %in% c(70:79)) {y <- "70-79"}
>> +   else if (x[11] %in% c(80:125)) {y <- "80+"}
>> +   return(y)
>> + }
>>> jj <- apply(mds[1:100,],1,FUN=ageassign)
>>> jj
>>      1       2       3       4       5       6       7       8       
>> 9      10      11      12      13
>>     NA   "80+" "30-69" "30-69"   "80+"      NA "30-69" "30-69" "70-79" 
>> "15-29" "15-29" "30-69" "70-79"
>>     14      15      16      17      18      19      20      21      
>> 22      23      24      25      26
>>  "80+"      NA "30-69" "30-69" "30-69"   "80+"   "80+" "15-29" "70-79" 
>> "30-69" "70-79" "70-79" "30-69"
>>     27      28      29      30      31      32      33      34      
>> 35      36      37      38      39
>> "70-79"   "80+"      NA   "80+" "70-79"      NA "15-29" "15-29"      
>> NA      NA "70-79" "30-69" "30-69"
>>     40      41      42      43      44      45      46      47      
>> 48      49      50      51      52
>> "70-79" "30-69" "30-69" "30-69" "70-79" "30-69" "30-69" "70-79" 
>> "15-29" "30-69"      NA "15-29" "30-69"
>>     53      54      55      56      57      58      59      60      
>> 61      62      63      64      65
>> "30-69"      NA "70-79" "30-69" "30-69" "30-69" "30-69" "15-29" 
>> "30-69" "30-69" "70-79" "30-69"      NA
>>     66      67      68      69      70      71      72      73      
>> 74      75      76      77      78
>> "30-69" "30-69" "30-69" "30-69" "30-69"   "80+" "30-69"   "80+" 
>> "70-79" "30-69" "30-69" "30-69"      NA
>>     79      80      81      82      83      84      85      86      
>> 87      88      89      90      91
>> "30-69" "30-69" "30-69"      NA   "80+" "30-69" "30-69" "30-69"      
>> NA "15-29" "30-69" "30-69" "30-69"
>>     92      93      94      95      96      97      98      99     100
>> "30-69" "30-69" "30-69" "30-69" "70-79" "30-69" "30-69" "30-69" "30-69"
>>> mds[1:100,11]
>>  [1]  3 82 40 35 82  1 37 57 71 22 21 52 73 86  1 43 60 63 84 88 29 73 
>> 69 75 73 43 75 83  4 83 77  1 27
>> [34] 15  1  6 76 51 45 71 54 64 69 70 48 38 74 26 37  4 18 63 59  8 78 
>> 63 67 62 50 21 66 69 75 57  4 50
>> [67] 58 60 61 62 83 69 92 75 30 49 69  1 69 63 69  0 93 64 59 69  2 25 
>> 32 60 66 67 54 53 64 79 59 49 59
>> [100] 64
>>> table(mds[,11])
>>
>>   0    1    2    3    4    5    6    7    8    9   10   11   12   13   
>> 14   15   16   17   18   19
>> 3123 6441 3856 2884 1968 1615 1386 1088 1098  721  943  681  511  380  
>> 426  835  571  555  719  653
>>  20   21   22   23   24   25   26   27   28   29   30   31   32   33   
>> 34   35   36   37   38   39
>> 879  715  672  631  655  773  680  713  769  538  685  566  729  702  
>> 652  766  683  723  821  675
>>  40   41   42   43   44   45   46   47   48   49   50   51   52   53   
>> 54   55   56   57   58   59
>> 774  650  908  892  784  925  781 1043 1161  924 1087  827 1261 1356 
>> 1297 1272 1277 1614 1831 1523
>>  60   61   62   63   64   65   66   67   68   69   70   71   72   73   
>> 74   75   76   77   78   79
>> 1702 1251 1954 2157 1901 2090 1874 2705 3085 2529 2488 1777 2701 2586 
>> 2308 2020 1801 2269 2486 1856
>>  80   81   82   83   84   85   86   87   88   89   90   91   92   93   
>> 94   95   96   97   98   99
>> 1762 1047 1413 1326  967 1013  753  870  884  531  601  277  364  301  
>> 193  288  149  174  169  470
>> 100  101  102  103  104  105  106  107  108  114  115  117  118  120  125
>>  15    2    5    7    2    4    1    1    2    1    1    2    2    2    1
>>> mode(mds[,11])
>> [1] "numeric"
>>
>>> mds[1,11] %in% c(0:4)
>> [1] TRUE
>>> if (mds[1,11] %in% c(0:4)) {y <- "0-4"}
>>> y
>> [1] "0-4"
>>
>>> xx <- matrix(trunc(runif(30,0,125)),15,2)
>>> aassign <- function(x){
>> +   y <- NA
>> +   if (x[2] %in% c(0:4)) {y <- "0-4"}
>> +   else if (x[2] %in% c(5:14)) {y <- "5-14" }
>> +   else if (x[2] %in% c(15:29)) {y <- "15-29" }
>> +   else if (x[2] %in% c(30:69)) {y <- "30-69"}
>> +   else if (x[2] %in% c(70:79)) {y <- "70-79"}
>> +   else if (x[2] %in% c(80:125)) {y <- "80+"}
>> +   return(y)
>> + }
>>> jj <- apply(xx,1,FUN=aassign)
>>> t(xx)
>>     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] 
>> [,13] [,14] [,15]
>> [1,]   23   98  107   94   76  103  106   40   66    11   109   101    
>> 96    37    18
>> [2,]   11   57   58   91   43  123  103   77    4    79    64    
>> 10     8   105    76
>>> jj
>> [1] "5-14"  "30-69" "30-69" "80+"   "30-69" "80+"   "80+"   "70-79" 
>> "0-4"   "70-79" "30-69" "5-14"
>> [13] "5-14"  "80+"   "70-79"
>>>
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
>




More information about the R-help mailing list