[R] Subsetting a data frame

jim holtman jholtman at gmail.com
Mon Dec 5 14:10:35 CET 2011


does this do what you want:

> db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1,
+ 2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28,
+ 9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class =
+ "data.frame", row.names = c(NA,
+ -4L))
>
> terms_include <- c("1","2","3")
> terms_exclude <- c("1.1","1.2","1.3")
>
> f.match <- function(obj, inc, exc){
+     pat <- paste("^(", paste(inc, collapse = "|"), ")", sep = '')
+     patex <- paste(exc, collapse = "|")
+     isMatch <- apply(obj, 1, function(x) any(grepl(pat, x)))
+     notMatch <- !apply(obj, 1, function(x) any(grepl(patex, x)))
+     obj[isMatch & notMatch,]
+ }
>
> db
   ind test1 test2 test3
1 ind1   1.0    56   1.1
2 ind2   2.0    27  28.0
3 ind3   1.3    58   9.0
4 ind4   3.0     2   1.2
> f.match(db, terms_include, terms_exclude)
   ind test1 test2 test3
2 ind2     2    27    28
>

On Mon, Dec 5, 2011 at 6:32 AM, natalie.vanzuydam <nvanzuydam at gmail.com> wrote:
> Hi R users,
>
> I really need help with subsetting  data frames:
>
> I have a large database of medical records and I want to be able to match
> patterns from a list of search terms .
>
> I've used this simplified data frame in a previous example:
>
>
> db <- structure(list(ind = c("ind1", "ind2", "ind3", "ind4"), test1 = c(1,
> 2, 1.3, 3), test2 = c(56L, 27L, 58L, 2L), test3 = c(1.1, 28,
> 9, 1.2)), .Names = c("ind", "test1", "test2", "test3"), class =
> "data.frame", row.names = c(NA,
> -4L))
>
> terms_include <- c("1","2","3")
> terms_exclude <- c("1.1","1.2","1.3")
>
>
> So in this example I want to include all the terms from terms include as
> long as they don't occur with terms exclude in the same row of the data
> frame.
>
> Previously I was given this function which works very well if you want to
> match exactly:
>
>
> f <- function(x)  !any(x %in% terms_exclude) && any(x %in% terms_include)
> db[apply(db[, -1], 1, f), ]
>
>   ind test1 test2 test3
> 2 ind2     2    27  28.0
> 4 ind4     3     2   1.2
>
>
> I would like to know if there is a way to write a similar function that
> looks for matches that start with the query string:  as in
> grepl("^pattern",x)
>
> I started writing a function but am not sure how to get it to return the
> dataframe or matrix:
>
>
> for (i in 1:length(terms_include)){
> db_new <- apply(db,2, grepl,pattern=i)
> }
>
> Applying this function gives me:
>
> db_new <- structure(c(FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE,
> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE), .Dim = c(4L,
> 4L), .Dimnames = list(NULL, c("ind", "test1", "test2", "test3"
> )))
>
> So the above is searching the pattern anywhere in the dataframe instead of
> just at the beginning of the string.
>
> How would I incorporate look for terms to include but don't return the row
> of the data frame if it also includes one of the terms to exclude while
> using partial matching?
>
> I hope that this makes sense.
>
> Many thanks,
> Natalie
>
> -----
> Natalie Van Zuydam
>
> PhD Student
> University of Dundee
> nvanzuydam at dundee.ac.uk
> --
> View this message in context: http://r.789695.n4.nabble.com/Subsetting-a-data-frame-tp4160127p4160127.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.



More information about the R-help mailing list