[R] [r] How to pick colums from a ragged array?

Wed Oct 24 19:38:12 CEST 2012

HI Stuart,

Just a small comment:

id.d1$id.d_INCLUDE<-TRUE #is not needed.

res1<- data.frame(flag=tapply(id.d[,2],id.d[,1],FUN=function(x) head(duplicated(x)|duplicated(x,fromLast=TRUE),1)|tail(duplicated(x)|duplicated(x,fromLast=TRUE),1)))
res2<-id.d[id.d[,1]%in%names(res1[res1$flag==TRUE,])&(duplicated(id.d[,1:2])|duplicated(id.d[,1:2],fromLast=TRUE)),]
res3<-res2[!res2$ID%in% res2[duplicated(res2)|duplicated(res2,fromLast=TRUE),]$ID,]
id.d1<-id.d
bad<-id.d1[id.d1$ID%in%res3$ID,]
bad$INCLUDE<-FALSE
res4<-merge(id.d1,bad,all=TRUE)
res4$INCLUDE[is.na(res4$INCLUDE)]<-TRUE
tail(res4)
#   ID     DATE DG INCLUDE
#35  910 20080521  4    TRUE
#36  910 20091224  2    TRUE
#37  999 20050503  2    TRUE
#38 1019 19870508  1   FALSE
#39 1019 19870508  2   FALSE
#40 1019 19880330  1   FALSE
A.K.

----- Original Message -----
From: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
To: "arun (smartpink111 at yahoo.com)" <smartpink111 at yahoo.com>; PIKAL Petr <petr.pikal at precheza.cz>; "Rui Barradas (ruipbarradas at sapo.pt)" <ruipbarradas at sapo.pt>
Cc: 
Sent: Wednesday, October 24, 2012 11:41 AM
Subject: RE: [r] How to pick colums from a ragged array?

(And, considering  the real application, the functions ideally should probably output a variable INCLUDE, the same length as the original data, with TRUE and FALSE for whether or not that row should be included...)

-----Original Message-----
From: Leask Stuart
Sent: 24 October 2012 16:25
To: arun (smartpink111 at yahoo.com); 'PIKAL Petr'; Rui Barradas (ruipbarradas at sapo.pt)
Subject: RE: [r] How to pick colums from a ragged array?

Arun, Petr, Rui, many thanks for your help, and the functions you have written.

You'll recall I wanted to remove these first (or last) duplicates, because they represented instances where two different diagnoses (in this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I can't say which was 'first' (or 'last').

Your functions have revealed something I wasn't expecting: In some cases, the diagnoses recorded on the duplicated DATEs are the same!
This is a surprise to me, but probably reflects someone going to two different departments in a clinic, and both departments submit data. I have to say that crazy things like this are often a feature of real data, which I'm sure you've come across yourselves.

Of course, I don't want to remove records in which I can determine an unambiguous 'first diagnosis'.

You have all put in so much effort on my behalf, I'm ashamed to ask, but I wonder if any of the functions you've written could do this with a little more
Indexing and the 'duplicate' function
So the function should only exclude an ID, having identified a first (or last) DATE duplicate, the DGs for these two dates are different.

Test dataset:

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080521,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20071205
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521
,20091224,20050503,19870508,19870508,19880330)

DG<-
c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1)

id.d<-data.frame(ID,DATE,DG)
id.d

# Considering Ruis  getRepeat function:

g.r<-getRepeat(id.d)    # defaults to first = TRUE getRepeat(id.d, first = FALSE)  to get the last ones
g.rr<-do.call(rbind, g.r) # put the data into a matrix

# I can remove the date duplicates with:
g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),]

I'm not sure how to add this to your suggestions, Arun & Petr...

Stuart

-----Original Message-----
From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
Sent: 23 October 2012 15:24
To: Stuart Leask
Subject: RE: [r] How to pick colums from a ragged array?

Hi

I assumed that id.d is data frame

id.d <- data.frame (ID,DATE )

and

fff(id.d)

works for me

Petr

> -----Original Message-----
> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
> Sent: Tuesday, October 23, 2012 3:13 PM
> To: PIKAL Petr
> Subject: RE: [r] How to pick colums from a ragged array?
>
> Hi Petr.
> I see what you mean it should do, but when I run it I get an error
> (see below).
> Stuart
>
>
> > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> + ,547,794,814,814,814,814,814,814,841,841,841,841,841
> + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> + ,1019)
> >
> > DATE <-
> +  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> +  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> +  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> +  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> +  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
> +  ,20091224,20050503,19870508,19870508,19880330)
> >
> >  id.d <- cbind (ID,DATE )
> > fff<-function(data, first=TRUE, remove=FALSE) {
> +
> + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> + x[nrow(x),2]==x[nrow(x)-1,2]
> +
> + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
> + data[,1]), testfirst))))) else sel <-
> + as.numeric(names(which(unlist(sapply(split(data, data[,1]),
> + testlast)))))
> +
> + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
> + }
> >
> > fff(id.d)
> Error in x[1, 2] : incorrect number of dimensions
> >
>
> -----Original Message-----
> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
> Sent: 23 October 2012 13:51
> To: Stuart Leask; r-help at r-project.org
> Subject: RE: [r] How to pick colums from a ragged array?
>
> Hi
>
> > -----Original Message-----
> > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
> > Sent: Tuesday, October 23, 2012 2:29 PM
> > To: PIKAL Petr; r-help at r-project.org
> > Subject: RE: [r] How to pick colums from a ragged array?
> >
> > Hi there.
> >
> > Not sure I follow what you are doing.
> >
> > I want a list of all the IDs that have duplicate DATE entries, only
> > when the DATE is the earliest (or last) date for that ID.
>
> And that is what the function (with 3 small modifications) does
>
>
> fff<-function(data, first=TRUE, remove=FALSE) {
>
> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> x[nrow(x),2]==x[nrow(x)-1,2]
>
> if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
> data[,1]), testfirst))))) else sel <-
> as.numeric(names(which(unlist(sapply(split(data, data[,1]),
> testlast)))))
>
> if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] }
>
> See the result of your refined data
>
> fff(id.d)
>      ID       DATE
> 5   167 2004-02-05
> 6   167 2004-02-05
> 22  841 2005-04-21
> 23  841 2005-04-21
> 24  841 2006-04-28
> 25  841 2006-06-02
> 26  841 2006-08-16
> 27  841 2006-10-25
> 28  841 2006-11-29
> 29  841 2007-01-12
> 30  841 2007-05-14
> 38 1019 1987-05-08
> 39 1019 1987-05-08
> 40 1019 1988-03-30
> > fff(id.d, first=F)
>    ID       DATE
> 5 167 2004-02-05
> 6 167 2004-02-05
> > fff(id.d, remove=T)
>     ID       DATE
> 1   58 2006-08-21
> 2   58 2006-12-07
> 3   58 2008-01-02
> 4   58 2009-09-04
> 7  323 2005-11-11
> 8  323 2006-01-11
> 9  323 2007-11-19
> 10 323 2008-01-07
> 11 323 2008-04-07
> 12 323 2008-05-21
> 13 323 2008-07-11
> 14 547 2004-10-05
> 15 794 2007-09-05
> 16 814 2002-08-14
> 17 814 2002-11-25
> 18 814 2004-04-29
> 19 814 2004-04-29
> 20 814 2007-12-05
> 21 814 2008-02-27
> 31 910 1987-05-08
> 32 910 2004-02-05
> 33 910 2004-02-05
> 34 910 2009-11-20
> 35 910 2009-12-10
> 36 910 2009-12-24
> 37 999 2005-05-03
> >
>
> You can do surgery on fff function to see what result comes from some
> piece of the function e.g.
>
> sapply(split(id.d, id.d[,1]), testlast)
>
> Regards
> Petr
>
> >
> > I have refined my test dataset, to include some tests (e.g. 910 has
> > the same dup as 1019, but for 910 it's not the earliest date):
> >
> >
> > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > ,1019)
> >
> > DATE <-
> >  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> >  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> >  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> >  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> >  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
> >  ,20091224,20050503,19870508,19870508,19880330)
> >
> > Correct output:
> > "167"  "841"  "1019"
> >
> > Stuart
> >
> > -----Original Message-----
> > From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
> > Sent: 23 October 2012 13:15
> > To: Stuart Leask; r-help at r-project.org
> > Subject: RE: [r] How to pick colums from a ragged array?
> >
> > Hi
> >
> > Rui's answer brought me to more elaborated solution which still
> > needs data frame to be ordered by date
> >
> > fff<-function(data, first=TRUE, remove=FALSE) {
> >
> > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> > x[length(x),2]==x[length(x)-1,2]
> >
> > if(first) sel <- as.numeric(names(which(sapply(split(data,
> > data[,1]),
> > testfirst)))) else sel <- as.numeric(names(which(sapply(split(data,
> > data[,1]), testlast))))
> >
> > if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
> >
> >
> > > fff(id.d)
> >     ID     DATE
> > 31 910 20091105
> > 32 910 20091105
> > 33 910 20091117
> > 34 910 20091119
> > 35 910 20091120
> > 36 910 20091210
> > 37 910 20091224
> > 38 910 20091224
> >
> > > fff(id.d, remove=T)
> >      ID     DATE
> > 1    58 20060821
> > 2    58 20061207
> > 3    58 20080102
> > 4    58 20090904
> > 5   167 20040205
> > 6   167 20040323
> > 7   323 20051111
> > 8   323 20060111
> > 9   323 20071119
> > 10  323 20080107
> > 11  323 20080407
> > 12  323 20080521
> > 13  323 20080711
> > 14  547 20041005
> > 15  794 20070905
> > 16  814 20020814
> > 17  814 20021125
> > 18  814 20040429
> > 19  814 20040429
> > 20  814 20071205
> > 21  814 20080227
> > 22  841 20050421
> > 23  841 20060130
> > 24  841 20060428
> > 25  841 20060602
> > 26  841 20060816
> > 27  841 20061025
> > 28  841 20061129
> > 29  841 20070112
> > 30  841 20070514
> > 39  999 20050503
> > 40 1019 19870508
> > 41 1019 19880223
> > 42 1019 19880330
> > 43 1019 19880330
> > >
> >
> > Regards
> > Petr
> >
> >
> > > -----Original Message-----
> > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > > project.org] On Behalf Of PIKAL Petr
> > > Sent: Tuesday, October 23, 2012 1:49 PM
> > > To: Stuart Leask; r-help at r-project.org
> > > Subject: Re: [R] [r] How to pick colums from a ragged array?
> > >
> > > Hi
> > >
> > > I did not check your code and rather followed your explanation.
> BTW,
> > > thanks for test data.
> > >
> > > small change in data frame to make DATE as Date class
> > >
> > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> > > data.frame(ID,datum )
> > >
> > > ordering by date
> > >
> > > id.d<-id.d[order(id.d$datum),]
> > >
> > >
> > > two functions to test if first two dates are the same or last two
> > > dates are the same
> > >
> > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> > > x[length(x),2]==x[length(x)-1,2]
> > >
> > > change one last date in the data frame to be the same as previous
> > >
> > > id.d[35,2]<-id.d[36,2]
> > >
> > > and here are results
> > >
> > > sapply(split(id.d, id.d$ID), testlast)
> > >    58   167   323   547   794   814   841   910   999  1019
> > > FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
> > >
> > > > sapply(split(id.d, id.d$ID), testfirst)
> > >    58   167   323   547   794   814   841   910   999  1019
> > > FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
> > >
> > > Now you can select ID which is true and remove it from your data
> > > which(sapply(split(id.d, id.d$ID), testlast))
> > >
> > > and use it for your data frame to subset/remove id.d$ID ==
> > > as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))
> [1]
> > > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> > > FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> > FALSE
> > > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> > FALSE
> > > FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
> > >
> > > However I am not sure if this is exactly what you want.
> > >
> > > Regards
> > > Petr
> > >
> > > > -----Original Message-----
> > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > > > project.org] On Behalf Of Stuart Leask
> > > > Sent: Tuesday, October 23, 2012 11:38 AM
> > > > To: r-help at r-project.org
> > > > Subject: [R] [r] How to pick colums from a ragged array?
> > > >
> > > > I have a large dataset (~1 million rows) of three variables: ID
> > > > (patient's name), DATE (of appointment) and DIAGNOSIS (given on
> > that
> > > > date).
> > > > Patients may have been assigned more than one diagnosis at any
> one
> > > > appointment - leading to two rows, same ID and DATE but
> > > > different DIAGNOSIS.
> > > > The diagnoses may change between appointments.
> > > >
> > > > I want to subset the data in two ways:
> > > >
> > > > -          define groups of patients by the first diagnosis given
> > > >
> > > > -          define groups of patients by the last diagnosis given.
> > > >
> > > > The problem:
> > > > Unfortunately, a small number of patients have been given more
> > > > than one diagnosis at their first (or last) appointment. These
> > > > individuals I need to identify and remove, as it's not possible
> to
> > > > say uniquely what their first (or last) diagnosis was. So I need
> > > > to identify and remove these individuals which have pairs of
> > > > rows with the same ID
> > > and
> > > > (lowest or highest) DATE. The size of the dataset precludes the
> > > option
> > > > of doing this by eye.
> > > >
> > > > I suspect there is a very elegant way of doing this in R.
> > > >
> > > > This is what I've come up with:
> > > >
> > > >
> > > > -          Sort by DATE then ID
> > > >
> > > > -          Make a ragged array of DATE by ID
> > > >
> > > > -          Remove IDs that only occur once.
> > > >
> > > > -          Subtract the first and second DATEs. Remove IDs for
> > which
> > > > this = zero, as this will only be true for IDs for which the
> > > > appointment is recorded twice (because there were two diagnoses
> > > > recorded on this date).
> > > >
> > > > -          (Then do the same to get the 'last appointment'
> > > duplicates,
> > > > by reversing the initial sort by DATE.)
> > > >
> > > > I am stuck at the 'Subtract dates' step: I would like to get the
> > > > data out of the ragged array by columns (so e.g. I end up with a
> > > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out
> > > > by column from the ragged array.
> > > >
> > > > I hope someone can help. My ugly code is below, with some data
> for
> > > > testing.
> > > >
> > > >
> > > > Stuart
> > > >
> > > >
> > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
> > > > Lecturer and Honorary Consultant Pychiatrist Institute of Mental
> > > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU.
> > UK
> > > > Tel. +44
> > > > 115 82 30419
> > > >
> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk
> > > > >
> > > > Google 'Dr Stuart Leask'
> > > >
> > > >
> > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > > > ,1019)
> > > >
> > > > DATE <-
> > > > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > > > ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > > > ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > > > ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> > > > ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
> > > > ,20091224,20050503,19870508,19880223,19880330)
> > > >
> > > > id.d <- cbind (ID,DATE )
> > > > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               #
> create
> > > > ragged array, 1-n DATES for every NAME
> > > >
> > > > # Inelegant attempt to remove IDs that only have one entry:
> > > >
> > > > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up
> > the
> > > > dates per row
> > > > # Since DATE is in 'year mo da', if there's only one date, sum
> > > > will
> > > be
> > > > less than 2100000:
> > > > rag.t <- rag.s [ rag.s > 21000000 ]
> > > > multi.dates <- rownames ( rag.t )                         # all
> the
> > > IDs
> > > > with >1 date
> > > > rag.am <- rag.a [ multi.dates ]                           #
> rag.am
> > > only
> > > > has IDs with > 1 Date
> > > >
> > > >
> > > > # But now I'm stuck.
> > > > # Each row of the array is rag.am$ID.
> > > > # So I can't pick columns of DATEs from the ragged array.
> > > >
> > > > This message and any attachment are intended solely for the
> > > > addressee and may contain confidential information. If you have
> > > > received this message in error, please send it back to me, and
> > > > immediately delete
> > > it.
> > > > Please do not use, copy or disclose the information contained in
> > > > this message or in any attachment.  Any views or opinions
> > > > expressed by the author of this email do not necessarily reflect
> > > > the views of the University of Nottingham.
> > > >
> > > > This message has been checked for viruses but the contents of an
> > > > attachment may still contain software viruses which could damage
> > > > your computer system:
> > > > you are advised to perform your own checks. Email communications
> > > > with the University of Nottingham may be monitored as permitted
> by
> > > > UK legislation.
> > > >         [[alternative HTML version deleted]]
> > > >
> > > > ______________________________________________
> > > > R-help at r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide http://www.R-
> project.org/posting-
> > > > guide.html and provide commented, minimal, self-contained,
> > > > reproducible code.
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide http://www.R-project.org/posting-
> > > guide.html and provide commented, minimal, self-contained,
> > > reproducible code.