[R] Regular Expressions + Matrices

Rui Barradas ruipbarradas at sapo.pt
Fri Aug 10 20:17:33 CEST 2012


Hello,

Try the following.


d <- read.table(textConnection("
ID NAME                          YEAR     SOURCE
1  'New York Mets'               1900      ESPN
2  'New York Yankees'          1920     Cooperstown
3  'Boston Redsox'               1918      ESPN
4  'Washington Nationals'      2010     ESPN
5  'Detroit Tigers'                  1990      ESPN
"), header=TRUE)

d$NAME <- as.character(d$NAME)

fun <- function(i, x){
     if(x[i, "ID"] != x[i + 1, "ID"]){
         s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1]
         if(grepl(s, x[i + 1, "NAME"])) return(TRUE)
     }
     FALSE
}

inx <- sapply(seq_len(nrow(d) - 1), fun, d)
inx <- c(inx, FALSE) | c(FALSE, inx)
d[inx, ]

Hope this helps,

Rui Barradas
Em 10-08-2012 18:41, Fred G escreveu:
> Hi all,
>
> My code looks like the following:
> inname = read.csv("ID_error_checker.csv", as.is=TRUE)
> outname = read.csv("output.csv", as.is=TRUE)
>
> #My algorithm is the following:
> #for line in inname
> #if first string up to whitespace in row in inname$name = first string up
> to whitespace in row + 1 in inname$name
> #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
> below it
> #copy these two lines to a new file
>
> In other words, if the name (up to the first whitespace) in the first row
> equals the name in the second row (etc for whole file) and the ID in the
> first row does not equal the ID in the second row, copy both of these rows
> in full to a new file.  Only caveat is that I want a regular expression not
> to take the full names, but just the first string up to the first
> whitespace in the inname$name column (ie if row1 has a name of: New York
> Mets and row2 has a name of New York Yankees, I would want both of these
> rows to be copied in full since "New" is the same in both...)
>
> Here is some example data:
> ID NAME                          YEAR     SOURCE     NOTES
> 1  New York Mets               1900      ESPN
> 2  New York Yankees          1920     Cooperstown
> 3  Boston Redsox               1918      ESPN
> 4  Washington Nationals      2010     ESPN
> 5  Detroit Tigers                  1990      ESPN
>
> The desired output would be:
> ID   NAME                    YEAR SOURCE
> 1    New York Mets        1900   ESPN
> 2    New York Yankees   1920   Cooperstown
>
> Thanks so much!
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list