[R] [r] How to pick colums from a ragged array?

Tue Oct 23 14:51:03 CEST 2012

Hi

> -----Original Message-----
> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
> Sent: Tuesday, October 23, 2012 2:29 PM
> To: PIKAL Petr; r-help at r-project.org
> Subject: RE: [r] How to pick colums from a ragged array?
> 
> Hi there.
> 
> Not sure I follow what you are doing.
> 
> I want a list of all the IDs that have duplicate DATE entries, only
> when the DATE is the earliest (or last) date for that ID.

And that is what the function (with 3 small modifications) does

fff<-function(data, first=TRUE, remove=FALSE) {

testfirst <- function(x) x[1,2]==x[2,2]
testlast <- function(x) x[nrow(x),2]==x[nrow(x)-1,2]

if(first) sel <- as.numeric(names(which(unlist(sapply(split(data, data[,1]), testfirst))))) else
sel <- as.numeric(names(which(unlist(sapply(split(data, data[,1]), testlast)))))

if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
}

See the result of your refined data

fff(id.d)
     ID       DATE
5   167 2004-02-05
6   167 2004-02-05
22  841 2005-04-21
23  841 2005-04-21
24  841 2006-04-28
25  841 2006-06-02
26  841 2006-08-16
27  841 2006-10-25
28  841 2006-11-29
29  841 2007-01-12
30  841 2007-05-14
38 1019 1987-05-08
39 1019 1987-05-08
40 1019 1988-03-30
> fff(id.d, first=F)
   ID       DATE
5 167 2004-02-05
6 167 2004-02-05
> fff(id.d, remove=T)
    ID       DATE
1   58 2006-08-21
2   58 2006-12-07
3   58 2008-01-02
4   58 2009-09-04
7  323 2005-11-11
8  323 2006-01-11
9  323 2007-11-19
10 323 2008-01-07
11 323 2008-04-07
12 323 2008-05-21
13 323 2008-07-11
14 547 2004-10-05
15 794 2007-09-05
16 814 2002-08-14
17 814 2002-11-25
18 814 2004-04-29
19 814 2004-04-29
20 814 2007-12-05
21 814 2008-02-27
31 910 1987-05-08
32 910 2004-02-05
33 910 2004-02-05
34 910 2009-11-20
35 910 2009-12-10
36 910 2009-12-24
37 999 2005-05-03
>

You can do surgery on fff function to see what result comes from some piece of the function e.g.

sapply(split(id.d, id.d[,1]), testlast)

Regards
Petr

> 
> I have refined my test dataset, to include some tests (e.g. 910 has the
> same dup as 1019, but for 910 it's not the earliest date):
> 
> 
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
> 
> DATE <-
>  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>  ,20091224,20050503,19870508,19870508,19880330)
> 
> Correct output:
> "167"  "841"  "1019"
> 
> Stuart
> 
> -----Original Message-----
> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
> Sent: 23 October 2012 13:15
> To: Stuart Leask; r-help at r-project.org
> Subject: RE: [r] How to pick colums from a ragged array?
> 
> Hi
> 
> Rui's answer brought me to more elaborated solution which still needs
> data frame to be ordered by date
> 
> fff<-function(data, first=TRUE, remove=FALSE) {
> 
> testfirst <- function(x) x[1,2]==x[2,2]
> testlast <- function(x) x[length(x),2]==x[length(x)-1,2]
> 
> if(first) sel <- as.numeric(names(which(sapply(split(data, data[,1]),
> testfirst)))) else sel <- as.numeric(names(which(sapply(split(data,
> data[,1]), testlast))))
> 
> if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
> 
> 
> > fff(id.d)
>     ID     DATE
> 31 910 20091105
> 32 910 20091105
> 33 910 20091117
> 34 910 20091119
> 35 910 20091120
> 36 910 20091210
> 37 910 20091224
> 38 910 20091224
> 
> > fff(id.d, remove=T)
>      ID     DATE
> 1    58 20060821
> 2    58 20061207
> 3    58 20080102
> 4    58 20090904
> 5   167 20040205
> 6   167 20040323
> 7   323 20051111
> 8   323 20060111
> 9   323 20071119
> 10  323 20080107
> 11  323 20080407
> 12  323 20080521
> 13  323 20080711
> 14  547 20041005
> 15  794 20070905
> 16  814 20020814
> 17  814 20021125
> 18  814 20040429
> 19  814 20040429
> 20  814 20071205
> 21  814 20080227
> 22  841 20050421
> 23  841 20060130
> 24  841 20060428
> 25  841 20060602
> 26  841 20060816
> 27  841 20061025
> 28  841 20061129
> 29  841 20070112
> 30  841 20070514
> 39  999 20050503
> 40 1019 19870508
> 41 1019 19880223
> 42 1019 19880330
> 43 1019 19880330
> >
> 
> Regards
> Petr
> 
> 
> > -----Original Message-----
> > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > project.org] On Behalf Of PIKAL Petr
> > Sent: Tuesday, October 23, 2012 1:49 PM
> > To: Stuart Leask; r-help at r-project.org
> > Subject: Re: [R] [r] How to pick colums from a ragged array?
> >
> > Hi
> >
> > I did not check your code and rather followed your explanation. BTW,
> > thanks for test data.
> >
> > small change in data frame to make DATE as Date class
> >
> > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> > data.frame(ID,datum )
> >
> > ordering by date
> >
> > id.d<-id.d[order(id.d$datum),]
> >
> >
> > two functions to test if first two dates are the same or last two
> > dates are the same
> >
> > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> > x[length(x),2]==x[length(x)-1,2]
> >
> > change one last date in the data frame to be the same as previous
> >
> > id.d[35,2]<-id.d[36,2]
> >
> > and here are results
> >
> > sapply(split(id.d, id.d$ID), testlast)
> >    58   167   323   547   794   814   841   910   999  1019
> > FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
> >
> > > sapply(split(id.d, id.d$ID), testfirst)
> >    58   167   323   547   794   814   841   910   999  1019
> > FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
> >
> > Now you can select ID which is true and remove it from your data
> > which(sapply(split(id.d, id.d$ID), testlast))
> >
> > and use it for your data frame to subset/remove id.d$ID ==
> > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))  [1]
> > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE
> > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE
> > FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
> >
> > However I am not sure if this is exactly what you want.
> >
> > Regards
> > Petr
> >
> > > -----Original Message-----
> > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > > project.org] On Behalf Of Stuart Leask
> > > Sent: Tuesday, October 23, 2012 11:38 AM
> > > To: r-help at r-project.org
> > > Subject: [R] [r] How to pick colums from a ragged array?
> > >
> > > I have a large dataset (~1 million rows) of three variables: ID
> > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on
> that
> > > date).
> > > Patients may have been assigned more than one diagnosis at any one
> > > appointment - leading to two rows, same ID and DATE but different
> > > DIAGNOSIS.
> > > The diagnoses may change between appointments.
> > >
> > > I want to subset the data in two ways:
> > >
> > > -          define groups of patients by the first diagnosis given
> > >
> > > -          define groups of patients by the last diagnosis given.
> > >
> > > The problem:
> > > Unfortunately, a small number of patients have been given more than
> > > one diagnosis at their first (or last) appointment. These
> > > individuals I need to identify and remove, as it's not possible to
> > > say uniquely what their first (or last) diagnosis was. So I need to
> > > identify and remove these individuals which have pairs of rows with
> > > the same ID
> > and
> > > (lowest or highest) DATE. The size of the dataset precludes the
> > option
> > > of doing this by eye.
> > >
> > > I suspect there is a very elegant way of doing this in R.
> > >
> > > This is what I've come up with:
> > >
> > >
> > > -          Sort by DATE then ID
> > >
> > > -          Make a ragged array of DATE by ID
> > >
> > > -          Remove IDs that only occur once.
> > >
> > > -          Subtract the first and second DATEs. Remove IDs for
> which
> > > this = zero, as this will only be true for IDs for which the
> > > appointment is recorded twice (because there were two diagnoses
> > > recorded on this date).
> > >
> > > -          (Then do the same to get the 'last appointment'
> > duplicates,
> > > by reversing the initial sort by DATE.)
> > >
> > > I am stuck at the 'Subtract dates' step: I would like to get the
> > > data out of the ragged array by columns (so e.g. I end up with a
> > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out by
> > > column from the ragged array.
> > >
> > > I hope someone can help. My ugly code is below, with some data for
> > > testing.
> > >
> > >
> > > Stuart
> > >
> > >
> > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
> > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental
> > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU.
> UK
> > > Tel. +44
> > > 115 82 30419
> > > stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
> > > Google 'Dr Stuart Leask'
> > >
> > >
> > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > > ,1019)
> > >
> > > DATE <-
> > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> > > ,20091224,20050503,19870508,19880223,19880330)
> > >
> > > id.d <- cbind (ID,DATE )
> > > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create
> > > ragged array, 1-n DATES for every NAME
> > >
> > > # Inelegant attempt to remove IDs that only have one entry:
> > >
> > > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up
> the
> > > dates per row
> > > # Since DATE is in 'year mo da', if there's only one date, sum will
> > be
> > > less than 2100000:
> > > rag.t <- rag.s [ rag.s > 21000000 ]
> > > multi.dates <- rownames ( rag.t )                         # all the
> > IDs
> > > with >1 date
> > > rag.am <- rag.a [ multi.dates ]                           # rag.am
> > only
> > > has IDs with > 1 Date
> > >
> > >
> > > # But now I'm stuck.
> > > # Each row of the array is rag.am$ID.
> > > # So I can't pick columns of DATEs from the ragged array.
> > >
> > > This message and any attachment are intended solely for the
> > > addressee and may contain confidential information. If you have
> > > received this message in error, please send it back to me, and
> > > immediately delete
> > it.
> > > Please do not use, copy or disclose the information contained in
> > > this message or in any attachment.  Any views or opinions expressed
> > > by the author of this email do not necessarily reflect the views of
> > > the University of Nottingham.
> > >
> > > This message has been checked for viruses but the contents of an
> > > attachment may still contain software viruses which could damage
> > > your computer system:
> > > you are advised to perform your own checks. Email communications
> > > with the University of Nottingham may be monitored as permitted by
> > > UK legislation.
> > > 	[[alternative HTML version deleted]]
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-
> > > guide.html and provide commented, minimal, self-contained,
> > > reproducible code.
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-
> > guide.html and provide commented, minimal, self-contained,
> > reproducible code.