[R] [r] How to pick colums from a ragged array?

Thu Oct 25 12:53:34 CEST 2012

Sorry, forgot to cc to rhelp

Petr


> -----Original Message-----
> From: PIKAL Petr
> Sent: Thursday, October 25, 2012 11:19 AM
> To: 'Stuart Leask'; arun (smartpink111 at yahoo.com)
> Subject: RE: [r] How to pick colums from a ragged array?
> 
> Hi
> 
> If I understand correctly you now want only to identify rows for which
> for a given ID, two or more first or last DATEs are same but DG is
> different and put TRUE/FALSE to new column
> 
> fff<-function(data) {
> 
> data$Identify <- FALSE
> 
> testfirst <- function(x) (x[1,"DATE"]==x[2,"DATE"]) &
> (x[1,"DG"]!=x[2,"DG"]) testlast <- function(x) {
> (x[nrow(x),"DATE"]==x[nrow(x)-1,"DATE"]) & (x[nrow(x),"DG"]!=x[nrow(x)-
> 1,"DG"])
> }
> 
> 
> sel <- as.numeric(names(which(unlist(sapply(split(data,data[,1]),
> testfirst)))))
> 
> sel <- c(sel, as.numeric(names(which(unlist(sapply(split(data,
> data[,1]), testlast))))))
> 
> data[data[,1] %in% sel,"Identify"] <- TRUE data }
> 
> I slightly modified my code to get rid of necessary user selection of
> first or last variant and put both together, add a new column and
> extended testing functions to evaluate DG and look if they are the same
> or different.
> 
> Does it suit to your purpose?
> 
> Regards
> Petr
> 
> 
> 
> > -----Original Message-----
> > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
> > Sent: Wednesday, October 24, 2012 5:25 PM
> > To: arun (smartpink111 at yahoo.com); PIKAL Petr; Rui Barradas
> > (ruipbarradas at sapo.pt)
> > Subject: RE: [r] How to pick colums from a ragged array?
> >
> > Arun, Petr, Rui, many thanks for your help, and the functions you
> have
> > written.
> >
> > You'll recall I wanted to remove these first (or last) duplicates,
> > because they represented instances where two different diagnoses (in
> > this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on
> > the same day - so I can't say which was 'first' (or 'last').
> >
> > Your functions have revealed something I wasn't expecting: In some
> > cases, the diagnoses recorded on the duplicated DATEs are the same!
> > This is a surprise to me, but probably reflects someone going to two
> > different departments in a clinic, and both departments submit data.
> I
> > have to say that crazy things like this are often a feature of real
> > data, which I'm sure you've come across yourselves.
> >
> > Of course, I don't want to remove records in which I can determine an
> > unambiguous 'first diagnosis'.
> >
> > You have all put in so much effort on my behalf, I'm ashamed to ask,
> > but I wonder if any of the functions you've written could do this
> with
> > a little more Indexing and the 'duplicate' function So the function
> > should only exclude an ID, having identified a first (or last) DATE
> > duplicate, the DGs for these two dates are different.
> >
> > Test dataset:
> >
> > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > ,1019)
> >
> > DATE <-
> >  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> >  ,20060111,20071119,20080107,20080407,20080521,20080521,20041005
> >  ,20070905,20020814,20021125,20040429,20040429,20071205,20071205
> >  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> >  ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521
> >  ,20091224,20050503,19870508,19870508,19880330)
> >
> > DG<-
> >
> c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,
> > 3
> > ,2,2,2,1,1)
> >
> > id.d<-data.frame(ID,DATE,DG)
> > id.d
> >
> > # Considering Ruis  getRepeat function:
> >
> > g.r<-getRepeat(id.d)    # defaults to first = TRUE getRepeat(id.d,
> > first = FALSE)  to get the last ones
> > g.rr<-do.call(rbind, g.r) # put the data into a matrix
> >
> > # I can remove the date duplicates with:
> > g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),]
> >
> > I'm not sure how to add this to your suggestions, Arun & Petr...
> >
> >
> > Stuart
> >
> >
> > -----Original Message-----
> > From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
> > Sent: 23 October 2012 15:24
> > To: Stuart Leask
> > Subject: RE: [r] How to pick colums from a ragged array?
> >
> > Hi
> >
> > I assumed that id.d is data frame
> >
> > id.d <- data.frame (ID,DATE )
> >
> > and
> >
> > fff(id.d)
> >
> > works for me
> >
> > Petr
> >
> >
> > > -----Original Message-----
> > > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
> > > Sent: Tuesday, October 23, 2012 3:13 PM
> > > To: PIKAL Petr
> > > Subject: RE: [r] How to pick colums from a ragged array?
> > >
> > > Hi Petr.
> > > I see what you mean it should do, but when I run it I get an error
> > > (see below).
> > > Stuart
> > >
> > >
> > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > > + ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > > + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > > + ,1019)
> > > >
> > > > DATE <-
> > > +  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> > > +  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > > +  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > > +  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> > > +  ,20070112,20070514, 19870508,20040205,20040205,
> 20091120,20091210
> > > +  ,20091224,20050503,19870508,19870508,19880330)
> > > >
> > > >  id.d <- cbind (ID,DATE )
> > > > fff<-function(data, first=TRUE, remove=FALSE) {
> > > +
> > > + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> > > + x[nrow(x),2]==x[nrow(x)-1,2]
> > > +
> > > + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
> > > + data[,1]), testfirst))))) else sel <-
> > > + as.numeric(names(which(unlist(sapply(split(data, data[,1]),
> > > + testlast)))))
> > > +
> > > + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in%
> > > + sel,] }
> > > >
> > > > fff(id.d)
> > > Error in x[1, 2] : incorrect number of dimensions
> > > >
> > >
> > > -----Original Message-----
> > > From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
> > > Sent: 23 October 2012 13:51
> > > To: Stuart Leask; r-help at r-project.org
> > > Subject: RE: [r] How to pick colums from a ragged array?
> > >
> > > Hi
> > >
> > > > -----Original Message-----
> > > > From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
> > > > Sent: Tuesday, October 23, 2012 2:29 PM
> > > > To: PIKAL Petr; r-help at r-project.org
> > > > Subject: RE: [r] How to pick colums from a ragged array?
> > > >
> > > > Hi there.
> > > >
> > > > Not sure I follow what you are doing.
> > > >
> > > > I want a list of all the IDs that have duplicate DATE entries,
> > > > only when the DATE is the earliest (or last) date for that ID.
> > >
> > > And that is what the function (with 3 small modifications) does
> > >
> > >
> > > fff<-function(data, first=TRUE, remove=FALSE) {
> > >
> > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> > > x[nrow(x),2]==x[nrow(x)-1,2]
> > >
> > > if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
> > > data[,1]), testfirst))))) else sel <-
> > > as.numeric(names(which(unlist(sapply(split(data, data[,1]),
> > > testlast)))))
> > >
> > > if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
> > > }
> > >
> > > See the result of your refined data
> > >
> > > fff(id.d)
> > >      ID       DATE
> > > 5   167 2004-02-05
> > > 6   167 2004-02-05
> > > 22  841 2005-04-21
> > > 23  841 2005-04-21
> > > 24  841 2006-04-28
> > > 25  841 2006-06-02
> > > 26  841 2006-08-16
> > > 27  841 2006-10-25
> > > 28  841 2006-11-29
> > > 29  841 2007-01-12
> > > 30  841 2007-05-14
> > > 38 1019 1987-05-08
> > > 39 1019 1987-05-08
> > > 40 1019 1988-03-30
> > > > fff(id.d, first=F)
> > >    ID       DATE
> > > 5 167 2004-02-05
> > > 6 167 2004-02-05
> > > > fff(id.d, remove=T)
> > >     ID       DATE
> > > 1   58 2006-08-21
> > > 2   58 2006-12-07
> > > 3   58 2008-01-02
> > > 4   58 2009-09-04
> > > 7  323 2005-11-11
> > > 8  323 2006-01-11
> > > 9  323 2007-11-19
> > > 10 323 2008-01-07
> > > 11 323 2008-04-07
> > > 12 323 2008-05-21
> > > 13 323 2008-07-11
> > > 14 547 2004-10-05
> > > 15 794 2007-09-05
> > > 16 814 2002-08-14
> > > 17 814 2002-11-25
> > > 18 814 2004-04-29
> > > 19 814 2004-04-29
> > > 20 814 2007-12-05
> > > 21 814 2008-02-27
> > > 31 910 1987-05-08
> > > 32 910 2004-02-05
> > > 33 910 2004-02-05
> > > 34 910 2009-11-20
> > > 35 910 2009-12-10
> > > 36 910 2009-12-24
> > > 37 999 2005-05-03
> > > >
> > >
> > > You can do surgery on fff function to see what result comes from
> > > some piece of the function e.g.
> > >
> > > sapply(split(id.d, id.d[,1]), testlast)
> > >
> > > Regards
> > > Petr
> > >
> > > >
> > > > I have refined my test dataset, to include some tests (e.g. 910
> > > > has the same dup as 1019, but for 910 it's not the earliest
> date):
> > > >
> > > >
> > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > > > ,1019)
> > > >
> > > > DATE <-
> > > >  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> > > >  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> > > >  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> > > >  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> > > >  ,20070112,20070514, 19870508,20040205,20040205,
> 20091120,20091210
> > > >  ,20091224,20050503,19870508,19870508,19880330)
> > > >
> > > > Correct output:
> > > > "167"  "841"  "1019"
> > > >
> > > > Stuart
> > > >
> > > > -----Original Message-----
> > > > From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
> > > > Sent: 23 October 2012 13:15
> > > > To: Stuart Leask; r-help at r-project.org
> > > > Subject: RE: [r] How to pick colums from a ragged array?
> > > >
> > > > Hi
> > > >
> > > > Rui's answer brought me to more elaborated solution which still
> > > > needs data frame to be ordered by date
> > > >
> > > > fff<-function(data, first=TRUE, remove=FALSE) {
> > > >
> > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> > > > x[length(x),2]==x[length(x)-1,2]
> > > >
> > > > if(first) sel <- as.numeric(names(which(sapply(split(data,
> > > > data[,1]),
> > > > testfirst)))) else sel <-
> > > > as.numeric(names(which(sapply(split(data,
> > > > data[,1]), testlast))))
> > > >
> > > > if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
> > > >
> > > >
> > > > > fff(id.d)
> > > >     ID     DATE
> > > > 31 910 20091105
> > > > 32 910 20091105
> > > > 33 910 20091117
> > > > 34 910 20091119
> > > > 35 910 20091120
> > > > 36 910 20091210
> > > > 37 910 20091224
> > > > 38 910 20091224
> > > >
> > > > > fff(id.d, remove=T)
> > > >      ID     DATE
> > > > 1    58 20060821
> > > > 2    58 20061207
> > > > 3    58 20080102
> > > > 4    58 20090904
> > > > 5   167 20040205
> > > > 6   167 20040323
> > > > 7   323 20051111
> > > > 8   323 20060111
> > > > 9   323 20071119
> > > > 10  323 20080107
> > > > 11  323 20080407
> > > > 12  323 20080521
> > > > 13  323 20080711
> > > > 14  547 20041005
> > > > 15  794 20070905
> > > > 16  814 20020814
> > > > 17  814 20021125
> > > > 18  814 20040429
> > > > 19  814 20040429
> > > > 20  814 20071205
> > > > 21  814 20080227
> > > > 22  841 20050421
> > > > 23  841 20060130
> > > > 24  841 20060428
> > > > 25  841 20060602
> > > > 26  841 20060816
> > > > 27  841 20061025
> > > > 28  841 20061129
> > > > 29  841 20070112
> > > > 30  841 20070514
> > > > 39  999 20050503
> > > > 40 1019 19870508
> > > > 41 1019 19880223
> > > > 42 1019 19880330
> > > > 43 1019 19880330
> > > > >
> > > >
> > > > Regards
> > > > Petr
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > > > > project.org] On Behalf Of PIKAL Petr
> > > > > Sent: Tuesday, October 23, 2012 1:49 PM
> > > > > To: Stuart Leask; r-help at r-project.org
> > > > > Subject: Re: [R] [r] How to pick colums from a ragged array?
> > > > >
> > > > > Hi
> > > > >
> > > > > I did not check your code and rather followed your explanation.
> > > BTW,
> > > > > thanks for test data.
> > > > >
> > > > > small change in data frame to make DATE as Date class
> > > > >
> > > > > datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
> > > > > data.frame(ID,datum )
> > > > >
> > > > > ordering by date
> > > > >
> > > > > id.d<-id.d[order(id.d$datum),]
> > > > >
> > > > >
> > > > > two functions to test if first two dates are the same or last
> > > > > two dates are the same
> > > > >
> > > > > testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
> > > > > x[length(x),2]==x[length(x)-1,2]
> > > > >
> > > > > change one last date in the data frame to be the same as
> > > > > previous
> > > > >
> > > > > id.d[35,2]<-id.d[36,2]
> > > > >
> > > > > and here are results
> > > > >
> > > > > sapply(split(id.d, id.d$ID), testlast)
> > > > >    58   167   323   547   794   814   841   910   999  1019
> > > > > FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
> > > > >
> > > > > > sapply(split(id.d, id.d$ID), testfirst)
> > > > >    58   167   323   547   794   814   841   910   999  1019
> > > > > FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
> > > > >
> > > > > Now you can select ID which is true and remove it from your
> data
> > > > > which(sapply(split(id.d, id.d$ID), testlast))
> > > > >
> > > > > and use it for your data frame to subset/remove id.d$ID ==
> > > > > as.numeric(names(which(sapply(split(id.d, id.d$ID),
> testlast))))
> > > [1]
> > > > > FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> > > > > FALSE FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE
> > > > > FALSE
> > > > FALSE
> > > > > FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
> FALSE
> > > > FALSE
> > > > > FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
> > > > >
> > > > > However I am not sure if this is exactly what you want.
> > > > >
> > > > > Regards
> > > > > Petr
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> > > > > > project.org] On Behalf Of Stuart Leask
> > > > > > Sent: Tuesday, October 23, 2012 11:38 AM
> > > > > > To: r-help at r-project.org
> > > > > > Subject: [R] [r] How to pick colums from a ragged array?
> > > > > >
> > > > > > I have a large dataset (~1 million rows) of three variables:
> > > > > > ID (patient's name), DATE (of appointment) and DIAGNOSIS
> > > > > > (given on
> > > > that
> > > > > > date).
> > > > > > Patients may have been assigned more than one diagnosis at
> any
> > > one
> > > > > > appointment - leading to two rows, same ID and DATE but
> > > > > > different DIAGNOSIS.
> > > > > > The diagnoses may change between appointments.
> > > > > >
> > > > > > I want to subset the data in two ways:
> > > > > >
> > > > > > -          define groups of patients by the first diagnosis
> > given
> > > > > >
> > > > > > -          define groups of patients by the last diagnosis
> > given.
> > > > > >
> > > > > > The problem:
> > > > > > Unfortunately, a small number of patients have been given
> more
> > > > > > than one diagnosis at their first (or last) appointment.
> These
> > > > > > individuals I need to identify and remove, as it's not
> > > > > > possible
> > > to
> > > > > > say uniquely what their first (or last) diagnosis was. So I
> > need
> > > > > > to identify and remove these individuals which have pairs of
> > > > > > rows with the same ID
> > > > > and
> > > > > > (lowest or highest) DATE. The size of the dataset precludes
> > > > > > the
> > > > > option
> > > > > > of doing this by eye.
> > > > > >
> > > > > > I suspect there is a very elegant way of doing this in R.
> > > > > >
> > > > > > This is what I've come up with:
> > > > > >
> > > > > >
> > > > > > -          Sort by DATE then ID
> > > > > >
> > > > > > -          Make a ragged array of DATE by ID
> > > > > >
> > > > > > -          Remove IDs that only occur once.
> > > > > >
> > > > > > -          Subtract the first and second DATEs. Remove IDs
> for
> > > > which
> > > > > > this = zero, as this will only be true for IDs for which the
> > > > > > appointment is recorded twice (because there were two
> > > > > > diagnoses recorded on this date).
> > > > > >
> > > > > > -          (Then do the same to get the 'last appointment'
> > > > > duplicates,
> > > > > > by reversing the initial sort by DATE.)
> > > > > >
> > > > > > I am stuck at the 'Subtract dates' step: I would like to get
> > the
> > > > > > data out of the ragged array by columns (so e.g. I end up
> with
> > a
> > > > > > matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates
> > out
> > > > > > by column from the ragged array.
> > > > > >
> > > > > > I hope someone can help. My ugly code is below, with some
> data
> > > for
> > > > > > testing.
> > > > > >
> > > > > >
> > > > > > Stuart
> > > > > >
> > > > > >
> > > > > > Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
> > > > > > Lecturer and Honorary Consultant Pychiatrist Institute of
> > Mental
> > > > > > Health, Innovation Park Triumph Road, Nottingham, Notts. NG7
> > 2TU.
> > > > UK
> > > > > > Tel. +44
> > > > > > 115 82 30419
> > > > > >
> > > stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk
> > > > > > >
> > > > > > Google 'Dr Stuart Leask'
> > > > > >
> > > > > >
> > > > > > ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> > > > > > ,547,794,814,814,814,814,814,814,841,841,841,841,841
> > > > > > ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> > > > > > ,1019)
> > > > > >
> > > > > > DATE <-
> > > > > >
> > c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> > > > > >
> ,20060111,20071119,20080107,20080407,20080521,20080711,2004100
> > > > > > 5
> > > > > >
> ,20070905,20020814,20021125,20040429,20040429,20071205,2008022
> > > > > > 7
> > > > > >
> ,20050421,20060130,20060428,20060602,20060816,20061025,2006112
> > > > > > 9
> > > > > >
> ,20070112,20070514,20091105,20091117,20091119,20091120,2009121
> > > > > > 0
> > > > > > ,20091224,20050503,19870508,19880223,19880330)
> > > > > >
> > > > > > id.d <- cbind (ID,DATE )
> > > > > > rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               #
> > > create
> > > > > > ragged array, 1-n DATES for every NAME
> > > > > >
> > > > > > # Inelegant attempt to remove IDs that only have one entry:
> > > > > >
> > > > > > rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)
> #add
> > up
> > > > the
> > > > > > dates per row
> > > > > > # Since DATE is in 'year mo da', if there's only one date,
> sum
> > > > > > will
> > > > > be
> > > > > > less than 2100000:
> > > > > > rag.t <- rag.s [ rag.s > 21000000 ]
> > > > > > multi.dates <- rownames ( rag.t )                         #
> all
> > > the
> > > > > IDs
> > > > > > with >1 date
> > > > > > rag.am <- rag.a [ multi.dates ]                           #
> > > rag.am
> > > > > only
> > > > > > has IDs with > 1 Date
> > > > > >
> > > > > >
> > > > > > # But now I'm stuck.
> > > > > > # Each row of the array is rag.am$ID.
> > > > > > # So I can't pick columns of DATEs from the ragged array.
> > > > > >
> > > > > > This message and any attachment are intended solely for the
> > > > > > addressee and may contain confidential information. If you
> > > > > > have received this message in error, please send it back to
> > > > > > me, and immediately delete
> > > > > it.
> > > > > > Please do not use, copy or disclose the information contained
> > in
> > > > > > this message or in any attachment.  Any views or opinions
> > > > > > expressed by the author of this email do not necessarily
> > reflect
> > > > > > the views of the University of Nottingham.
> > > > > >
> > > > > > This message has been checked for viruses but the contents of
> > an
> > > > > > attachment may still contain software viruses which could
> > damage
> > > > > > your computer system:
> > > > > > you are advised to perform your own checks. Email
> > communications
> > > > > > with the University of Nottingham may be monitored as
> > > > > > permitted
> > > by
> > > > > > UK legislation.
> > > > > >         [[alternative HTML version deleted]]
> > > > > >
> > > > > > ______________________________________________
> > > > > > R-help at r-project.org mailing list
> > > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > > PLEASE do read the posting guide http://www.R-
> > > project.org/posting-
> > > > > > guide.html and provide commented, minimal, self-contained,
> > > > > > reproducible code.
> > > > >
> > > > > ______________________________________________
> > > > > R-help at r-project.org mailing list
> > > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > > PLEASE do read the posting guide http://www.R-
> > project.org/posting-
> > > > > guide.html and provide commented, minimal, self-contained,
> > > > > reproducible code.