[R] How do I identify non-sequential data?
Lopez, Dan
lopez235 at llnl.gov
Fri Nov 22 01:38:29 CET 2013
Hi Don,
Yes, I am error checking a dataset produced by a query. Most likely a problem with the query but wanted to assess the problem first.
BTW Arun provided another solution which is similar to yours but uses the function ave instead:
testSeq[!!(with(testSeq,ave(YoS,ID,FUN=function(x) any(c(0,diff(x))>1)))),]
I appreciate your response on this.
Dan
-----Original Message-----
From: MacQueen, Don
Sent: Thursday, November 21, 2013 3:58 PM
To: Lopez, Dan; R help (r-help at r-project.org)
Subject: Re: [R] How do I identify non-sequential data?
Dan,
Does this do it?
## where dt is the data
tmp <- split(dt, dt$ID)
foo <- lapply(tmp, function(x) any(diff(x$YoS) > 1))
foo <- data.frame( ID=names(foo), gap=unlist(foo))
Note that I ignored dept.
Little hard to see how YoS can increase by more than one when the year increases by only one ... unless this is a search for erroneous data.
-Don
--
Don MacQueen
Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062
On 11/21/13 3:32 PM, "Lopez, Dan" <lopez235 at llnl.gov> wrote:
>Hi R Experts,
>
>About the data:
>My data consists of people (ID) with years of service (Yos) for each
>year. An ID can appear multiple times.
>The data is sorted by ID then by Year.
>
>Problem:
>I need to extract ID data with non-sequential YoS rows. For example
>below that would be all rows for ID 33 and 16 since they have a
>non-sequential YoS.
>To accomplish this I figured I could create a column called 'CheckVal'
>that takes current row YoS minus previous row YoS. The first instance
>for each ID will be 0. 'CheckVal' in the below data set was created in Excel.
>I want to know how to do this in R.
>Is there a package I can use or specific function or set of functions I
>can use to accomplish this?
>
>#My data looks like:
>> testSeq
>
> ID Year YoS CheckVal dept
>
>1 12 2010 1.1 0.0 A
>
>2 12 2011 2.1 1.0 A
>
>3 44 2009 1.4 0.0 C
>
>4 44 2010 2.4 1.0 C
>
>5 44 2011 3.4 1.0 B
>
>6 33 2009 2.3 0.0 A
>
>7 33 2010 4.4 2.1 A
>
>8 16 2009 1.6 0.0 B
>
>9 16 2010 2.6 1.0 B
>
>10 16 2011 5.6 3.0 C
>
>11 16 2012 6.6 1.0 A
>
>#here is dput of data for R
>
>Structure(list(ID = c(12, 12, 44, 44, 44, 33, 33, 16, 16, 16,
>
>16), Year = c(2010, 2011, 2009, 2010, 2011, 2009, 2010, 2009,
>
>2010, 2011, 2012), YoS = c(1.1, 2.1, 1.4, 2.4, 3.4, 2.3, 4.4,
>
>1.6, 2.6, 5.6, 6.6), CheckVal = c(0, 1, 0, 1, 1, 0, 2.1, 0, 1,
>
>3, 1), dept = structure(c(1L, 1L, 3L, 3L, 2L, 1L, 1L, 2L, 2L,
>
>3L, 1L), .Label = c("A", "B", "C"), class = "factor")), .Names =
>c("ID",
>
>"Year", "YoS", "CheckVal", "dept"), row.names = c(NA, 11L), class =
>"data.frame")
>
>Dan
>Workforce Analyst
>LLNL
>
> [[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list