[R] How do I identify non-sequential data?

Lopez, Dan lopez235 at llnl.gov
Fri Nov 22 01:38:29 CET 2013


Hi Don,

Yes, I am error checking a dataset produced by a query.  Most likely a problem with the query but wanted to assess the problem first.

BTW Arun provided another solution which is similar to yours but uses the function ave instead:
 testSeq[!!(with(testSeq,ave(YoS,ID,FUN=function(x) any(c(0,diff(x))>1)))),]

I appreciate your response on this.
Dan


-----Original Message-----
From: MacQueen, Don 
Sent: Thursday, November 21, 2013 3:58 PM
To: Lopez, Dan; R help (r-help at r-project.org)
Subject: Re: [R] How do I identify non-sequential data?

Dan,
Does this do it?

## where dt is the data

tmp <- split(dt, dt$ID)

foo <- lapply(tmp, function(x) any(diff(x$YoS) > 1))

foo <- data.frame( ID=names(foo), gap=unlist(foo))

Note that I ignored dept.
Little hard to see how YoS can increase by more than one when the year increases by only one ... unless this is a search for erroneous data.

-Don



--
Don MacQueen

Lawrence Livermore National Laboratory
7000 East Ave., L-627
Livermore, CA 94550
925-423-1062





On 11/21/13 3:32 PM, "Lopez, Dan" <lopez235 at llnl.gov> wrote:

>Hi R Experts,
>
>About the data:
>My data consists of people (ID) with years of service (Yos) for each 
>year. An ID can appear multiple times.
>The data is sorted by ID then by Year.
>
>Problem:
>I need to extract ID data with non-sequential YoS rows. For example 
>below that would be all rows for ID 33 and 16 since they have a 
>non-sequential YoS.
>To accomplish this I figured I could create a column called 'CheckVal'
>that takes current row YoS minus previous row YoS. The first instance 
>for each ID will be 0. 'CheckVal' in the below data set was created in Excel.
>I want to know how to do this in R.
>Is there a package I can use or specific function or set of functions I 
>can use to accomplish this?
>
>#My data looks like:
>> testSeq
>
>   ID Year YoS CheckVal dept
>
>1  12 2010 1.1      0.0    A
>
>2  12 2011 2.1      1.0    A
>
>3  44 2009 1.4      0.0    C
>
>4  44 2010 2.4      1.0    C
>
>5  44 2011 3.4      1.0    B
>
>6  33 2009 2.3      0.0    A
>
>7  33 2010 4.4      2.1    A
>
>8  16 2009 1.6      0.0    B
>
>9  16 2010 2.6      1.0    B
>
>10 16 2011 5.6      3.0    C
>
>11 16 2012 6.6      1.0    A
>
>#here is dput of data for R
>
>Structure(list(ID = c(12, 12, 44, 44, 44, 33, 33, 16, 16, 16,
>
>16), Year = c(2010, 2011, 2009, 2010, 2011, 2009, 2010, 2009,
>
>2010, 2011, 2012), YoS = c(1.1, 2.1, 1.4, 2.4, 3.4, 2.3, 4.4,
>
>1.6, 2.6, 5.6, 6.6), CheckVal = c(0, 1, 0, 1, 1, 0, 2.1, 0, 1,
>
>3, 1), dept = structure(c(1L, 1L, 3L, 3L, 2L, 1L, 1L, 2L, 2L,
>
>3L, 1L), .Label = c("A", "B", "C"), class = "factor")), .Names = 
>c("ID",
>
>"Year", "YoS", "CheckVal", "dept"), row.names = c(NA, 11L), class =
>"data.frame")
>
>Dan
>Workforce Analyst
>LLNL
>
>	[[alternative HTML version deleted]]
>
>______________________________________________
>R-help at r-project.org mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide
>http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list