[R] Cleaning database: grep()? apply()?

jim holtman jholtman at gmail.com
Tue Nov 13 23:25:14 CET 2007


Here is how to wittle it down for the first two parts of your
question.  I am not exactly what you are after in the third part.  Is
it that you want specific DATEs or do you want the ratio of the
DATE[max]/DATE[min]?

> x <- read.table(textConnection("CODE    NAME                                                   DATE         DATA1
+ 4813    'ADVANCED TELECOM'                        1987    0.013
+ 3845    'ADVANCED THERAPEUTIC SYS LTD'    1987    10.1
+ 3845    'ADVANCED THERAPEUTIC SYS LTD'    1989    2.463
+ 3845    'ADVANCED THERAPEUTIC SYS LTD'    1988    1.563
+ 2836    'ADVANCED TISSUE SCI  -CL A'                      1987    0.847
+ 2836    'ADVANCED TISSUE SCI  -CL A'                       1989   0.872
+ 2836    'ADVANCED TISSUE SCI  -CL A'                       1988
0.529"), header=TRUE)
> # matches on things to delete
> delete_indx <- grep("-CL A$|-OLD$|-ADS$", x$NAME)
> # delete them
> x <- x[-delete_indx,]
> x
  CODE                         NAME DATE  DATA1
1 4813             ADVANCED TELECOM 1987  0.013
2 3845 ADVANCED THERAPEUTIC SYS LTD 1987 10.100
3 3845 ADVANCED THERAPEUTIC SYS LTD 1989  2.463
4 3845 ADVANCED THERAPEUTIC SYS LTD 1988  1.563
> # I assume you want to use NAME to check for ranges of data
> date_range <- tapply(x$DATE, x$NAME, function(dates) diff(range(dates)))
> date_range
            ADVANCED TELECOM ADVANCED THERAPEUTIC SYS LTD
                           0                            2
  ADVANCED TISSUE SCI  -CL A
                          NA
> # delete ones with less than 3 years
> names_to_delete <- names(date_range[date_range < 2])
> # delete those entries
> x <- x[!(x$NAME %in% names_to_delete),]
> x
  CODE                         NAME DATE  DATA1
2 3845 ADVANCED THERAPEUTIC SYS LTD 1987 10.100
3 3845 ADVANCED THERAPEUTIC SYS LTD 1989  2.463
4 3845 ADVANCED THERAPEUTIC SYS LTD 1988  1.563
>
>


On Nov 13, 2007 2:34 PM, Jonas Malmros <jonas.malmros at gmail.com> wrote:
> Dear R users,
>
> I have a huge database and I need to adjust it somewhat.
>
> Here is a very little cut out from database:
>
> CODE    NAME                                                   DATE         DATA1
> 4813    ADVANCED TELECOM                        1987    0.013
> 3845    ADVANCED THERAPEUTIC SYS LTD    1987    10.1
> 3845    ADVANCED THERAPEUTIC SYS LTD    1989    2.463
> 3845    ADVANCED THERAPEUTIC SYS LTD    1988    1.563
> 2836    ADVANCED TISSUE SCI  -CL A                      1987    0.847
> 2836    ADVANCED TISSUE SCI  -CL A                       1989   0.872
> 2836    ADVANCED TISSUE SCI  -CL A                       1988   0.529
>
> What I need is:
> 1) Delete all cases containing -CL A (and also -OLD, -ADS, etc) at the end
> 2) Delete all cases that have less than 3 years of data
> 3) For each remaining case compute ratio DATA1(1989) / DATA1(1987)
> [and then ratios involving other data variables] and output this into
> new database consisting of CODE, NAME, RATIOs.
>
> Maybe someone can suggest an effective way to do these things? I
> imagine the first one would involve grep(), and 2 and 3 would involve
> apply family of functions, but I cannot get my mind around the actual
> code to perform this adjustments. I am new to R, I do write code but
> usually it consists of for-functions and plotting. I would much
> appreciate your help.
> Thank you in advance!
> --
> Jonas Malmros
> Stockholm University
> Stockholm, Sweden
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?



More information about the R-help mailing list