[R] [r] How to pick colums from a ragged array?

Wed Oct 24 20:05:40 CEST 2012

Hi Rui,

I think now our results are matching except in the INCLUDE column 

id.d[c(11:13,22:24,38:40),]
#     ID     DATE DG INCLUDE
#11  323 20080407  1    TRUE
#12  323 20080521  2   FALSE
#13  323 20080521  3    TRUE
#22  841 20050421  1    TRUE
#23  841 20050421  2   FALSE
#24  841 20060428  1    TRUE
#38 1019 19870508  2    TRUE
#39 1019 19870508  1   FALSE
#40 1019 19880330  1    TRUE

I thought all the rows with the above IDS would be FALSE (from my solution):

res4[c(11:13,22:24,38:40),]
     ID     DATE DG INCLUDE
#11  323 20080407  1   FALSE
#12  323 20080521  2   FALSE
#13  323 20080521  3   FALSE
#22  841 20050421  1   FALSE
#23  841 20050421  2   FALSE
#24  841 20060428  1   FALSE
#38 1019 19870508  1   FALSE
#39 1019 19870508  2   FALSE
#40 1019 19880330  1   FALSE

A.K.

----- Original Message -----
From: Rui Barradas <ruipbarradas at sapo.pt>
To: Stuart Leask <Stuart.Leask at nottingham.ac.uk>
Cc: "arun (smartpink111 at yahoo.com)" <smartpink111 at yahoo.com>; PIKAL Petr <petr.pikal at precheza.cz>; r-help <r-help at r-project.org>
Sent: Wednesday, October 24, 2012 1:41 PM
Subject: Re: [r] How to pick colums from a ragged array?

Hello,

Using one of Arun's ideas, some post ago, this new function returns a 
logical index into id.d of the rows that should be _removed_, hence rm1 
and rm2. I think

getRepLogical <- function(x, first = TRUE){
     fun <- if(first) head else tail
     dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2)))
     len <- tapply(x[,2], x[,1], FUN = length)
     lst <- lapply(seq_along(dte), function(i) c(dte[[i]], rep(FALSE, 
if(len[[i]] > 2) len[[i]] - 2 else 0)))
     lst <- if(first) lst else lapply(lst, rev)
     i1 <- unlist(lst)
     dg <- tapply(x[,3], x[,1], FUN = function(x) !duplicated(fun(x, 2)))
     lst <- lapply(seq_along(dte), function(i) c(dg[[i]], rep(FALSE, 
if(len[[i]] > 2) len[[i]] - 2 else 0)))
     lst <- if(first) lst else lapply(lst, rev)
     i2 <- unlist(lst)
     i1 & i2
}

rm1 <- getRepLogical(id.d)
rm2 <- getRepLogical(id.d, first = FALSE)

id.d[rm1, ]
id.d[rm2, ]

id.d$INCLUDE <- !(rm1 | rm2)

Hope this helps,

Rui Barradas
Em 24-10-2012 16:41, Stuart Leask escreveu:
> (And, considering  the real application, the functions ideally should probably output a variable INCLUDE, the same length as the original data, with TRUE and FALSE for whether or not that row should be included...)
>
> -----Original Message-----
> From: Leask Stuart
> Sent: 24 October 2012 16:25
> To: arun (smartpink111 at yahoo.com); 'PIKAL Petr'; Rui Barradas (ruipbarradas at sapo.pt)
> Subject: RE: [r] How to pick colums from a ragged array?
>
> Arun, Petr, Rui, many thanks for your help, and the functions you have written.
>
> You'll recall I wanted to remove these first (or last) duplicates, because they represented instances where two different diagnoses (in this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on the same day - so I can't say which was 'first' (or 'last').
>
> Your functions have revealed something I wasn't expecting: In some cases, the diagnoses recorded on the duplicated DATEs are the same!
> This is a surprise to me, but probably reflects someone going to two different departments in a clinic, and both departments submit data. I have to say that crazy things like this are often a feature of real data, which I'm sure you've come across yourselves.
>
> Of course, I don't want to remove records in which I can determine an unambiguous 'first diagnosis'.
>
> You have all put in so much effort on my behalf, I'm ashamed to ask, but I wonder if any of the functions you've written could do this with a little more
> Indexing and the 'duplicate' function
> So the function should only exclude an ID, having identified a first (or last) DATE duplicate, the DGs for these two dates are different.
>
> Test dataset:
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
>   c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>   ,20060111,20071119,20080107,20080407,20080521,20080521,20041005
>   ,20070905,20020814,20021125,20040429,20040429,20071205,20071205
>   ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>   ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521
>   ,20091224,20050503,19870508,19870508,19880330)
>
> DG<-
> c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1)
>
> id.d<-data.frame(ID,DATE,DG)
> id.d
>
> # Considering Ruis  getRepeat function:
>
> g.r<-getRepeat(id.d)    # defaults to first = TRUE getRepeat(id.d, first = FALSE)  to get the last ones
> g.rr<-do.call(rbind, g.r) # put the data into a matrix
>
> # I can remove the date duplicates with:
> g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),]
>
> I'm not sure how to add this to your suggestions, Arun & Petr...
>
>
> Stuart
>
>
> -----Original Message-----
> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
> Sent: 23 October 2012 15:24
> To: Stuart Leask
> Subject: RE: [r] How to pick colums from a ragged array?
>
> Hi
>
> I assumed that id.d is data frame
>
> id.d <- data.frame (ID,DATE )
>
> and
>
> fff(id.d)
>
> works for me
>
> Petr
>
>
>> -----Original Message-----
>> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
>> Sent: Tuesday, October 23, 2012 3:13 PM
>> To: PIKAL Petr
>> Subject: RE: [r] How to pick colums from a ragged array?
>>
>> Hi Petr.
>> I see what you mean it should do, but when I run it I get an error
>> (see below).
>> Stuart
>>
>>
>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> + ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> + ,1019)
>>> DATE <-
>> +  c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>> +  ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> +  ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> +  ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>> +  ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>> +  ,20091224,20050503,19870508,19870508,19880330)
>>>   id.d <- cbind (ID,DATE )
>>> fff<-function(data, first=TRUE, remove=FALSE) {
>> +
>> + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>> + x[nrow(x),2]==x[nrow(x)-1,2]
>> +
>> + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
>> + data[,1]), testfirst))))) else sel <-
>> + as.numeric(names(which(unlist(sapply(split(data, data[,1]),
>> + testlast)))))
>> +
>> + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
>> + }
>>> fff(id.d)
>> Error in x[1, 2] : incorrect number of dimensions
>> -----Original Message-----
>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>> Sent: 23 October 2012 13:51
>> To: Stuart Leask; r-help at r-project.org
>> Subject: RE: [r] How to pick colums from a ragged array?
>>
>> Hi
>>
>>> -----Original Message-----
>>> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
>>> Sent: Tuesday, October 23, 2012 2:29 PM
>>> To: PIKAL Petr; r-help at r-project.org
>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>
>>> Hi there.
>>>
>>> Not sure I follow what you are doing.
>>>
>>> I want a list of all the IDs that have duplicate DATE entries, only
>>> when the DATE is the earliest (or last) date for that ID.
>> And that is what the function (with 3 small modifications) does
>>
>>
>> fff<-function(data, first=TRUE, remove=FALSE) {
>>
>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>> x[nrow(x),2]==x[nrow(x)-1,2]
>>
>> if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
>> data[,1]), testfirst))))) else sel <-
>> as.numeric(names(which(unlist(sapply(split(data, data[,1]),
>> testlast)))))
>>
>> if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] }
>>
>> See the result of your refined data
>>
>> fff(id.d)
>>       ID       DATE
>> 5   167 2004-02-05
>> 6   167 2004-02-05
>> 22  841 2005-04-21
>> 23  841 2005-04-21
>> 24  841 2006-04-28
>> 25  841 2006-06-02
>> 26  841 2006-08-16
>> 27  841 2006-10-25
>> 28  841 2006-11-29
>> 29  841 2007-01-12
>> 30  841 2007-05-14
>> 38 1019 1987-05-08
>> 39 1019 1987-05-08
>> 40 1019 1988-03-30
>>> fff(id.d, first=F)
>>     ID       DATE
>> 5 167 2004-02-05
>> 6 167 2004-02-05
>>> fff(id.d, remove=T)
>>      ID       DATE
>> 1   58 2006-08-21
>> 2   58 2006-12-07
>> 3   58 2008-01-02
>> 4   58 2009-09-04
>> 7  323 2005-11-11
>> 8  323 2006-01-11
>> 9  323 2007-11-19
>> 10 323 2008-01-07
>> 11 323 2008-04-07
>> 12 323 2008-05-21
>> 13 323 2008-07-11
>> 14 547 2004-10-05
>> 15 794 2007-09-05
>> 16 814 2002-08-14
>> 17 814 2002-11-25
>> 18 814 2004-04-29
>> 19 814 2004-04-29
>> 20 814 2007-12-05
>> 21 814 2008-02-27
>> 31 910 1987-05-08
>> 32 910 2004-02-05
>> 33 910 2004-02-05
>> 34 910 2009-11-20
>> 35 910 2009-12-10
>> 36 910 2009-12-24
>> 37 999 2005-05-03
>> You can do surgery on fff function to see what result comes from some
>> piece of the function e.g.
>>
>> sapply(split(id.d, id.d[,1]), testlast)
>>
>> Regards
>> Petr
>>
>>> I have refined my test dataset, to include some tests (e.g. 910 has
>>> the same dup as 1019, but for 910 it's not the earliest date):
>>>
>>>
>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>> ,1019)
>>>
>>> DATE <-
>>>   c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>>>   ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>>   ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>>   ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>>>   ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>>>   ,20091224,20050503,19870508,19870508,19880330)
>>>
>>> Correct output:
>>> "167"  "841"  "1019"
>>>
>>> Stuart
>>>
>>> -----Original Message-----
>>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>>> Sent: 23 October 2012 13:15
>>> To: Stuart Leask; r-help at r-project.org
>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>
>>> Hi
>>>
>>> Rui's answer brought me to more elaborated solution which still
>>> needs data frame to be ordered by date
>>>
>>> fff<-function(data, first=TRUE, remove=FALSE) {
>>>
>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>>> x[length(x),2]==x[length(x)-1,2]
>>>
>>> if(first) sel <- as.numeric(names(which(sapply(split(data,
>>> data[,1]),
>>> testfirst)))) else sel <- as.numeric(names(which(sapply(split(data,
>>> data[,1]), testlast))))
>>>
>>> if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
>>>
>>>
>>>> fff(id.d)
>>>      ID     DATE
>>> 31 910 20091105
>>> 32 910 20091105
>>> 33 910 20091117
>>> 34 910 20091119
>>> 35 910 20091120
>>> 36 910 20091210
>>> 37 910 20091224
>>> 38 910 20091224
>>>
>>>> fff(id.d, remove=T)
>>>       ID     DATE
>>> 1    58 20060821
>>> 2    58 20061207
>>> 3    58 20080102
>>> 4    58 20090904
>>> 5   167 20040205
>>> 6   167 20040323
>>> 7   323 20051111
>>> 8   323 20060111
>>> 9   323 20071119
>>> 10  323 20080107
>>> 11  323 20080407
>>> 12  323 20080521
>>> 13  323 20080711
>>> 14  547 20041005
>>> 15  794 20070905
>>> 16  814 20020814
>>> 17  814 20021125
>>> 18  814 20040429
>>> 19  814 20040429
>>> 20  814 20071205
>>> 21  814 20080227
>>> 22  841 20050421
>>> 23  841 20060130
>>> 24  841 20060428
>>> 25  841 20060602
>>> 26  841 20060816
>>> 27  841 20061025
>>> 28  841 20061129
>>> 29  841 20070112
>>> 30  841 20070514
>>> 39  999 20050503
>>> 40 1019 19870508
>>> 41 1019 19880223
>>> 42 1019 19880330
>>> 43 1019 19880330
>>> Regards
>>> Petr
>>>
>>>
>>>> -----Original Message-----
>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>>> project.org] On Behalf Of PIKAL Petr
>>>> Sent: Tuesday, October 23, 2012 1:49 PM
>>>> To: Stuart Leask; r-help at r-project.org
>>>> Subject: Re: [R] [r] How to pick colums from a ragged array?
>>>>
>>>> Hi
>>>>
>>>> I did not check your code and rather followed your explanation.
>> BTW,
>>>> thanks for test data.
>>>>
>>>> small change in data frame to make DATE as Date class
>>>>
>>>> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
>>>> data.frame(ID,datum )
>>>>
>>>> ordering by date
>>>>
>>>> id.d<-id.d[order(id.d$datum),]
>>>>
>>>>
>>>> two functions to test if first two dates are the same or last two
>>>> dates are the same
>>>>
>>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>>>> x[length(x),2]==x[length(x)-1,2]
>>>>
>>>> change one last date in the data frame to be the same as previous
>>>>
>>>> id.d[35,2]<-id.d[36,2]
>>>>
>>>> and here are results
>>>>
>>>> sapply(split(id.d, id.d$ID), testlast)
>>>>     58   167   323   547   794   814   841   910   999  1019
>>>> FALSE FALSE FALSE    NA    NA FALSE FALSE  TRUE    NA FALSE
>>>>
>>>>> sapply(split(id.d, id.d$ID), testfirst)
>>>>     58   167   323   547   794   814   841   910   999  1019
>>>> FALSE FALSE FALSE    NA    NA FALSE FALSE FALSE    NA FALSE
>>>>
>>>> Now you can select ID which is true and remove it from your data
>>>> which(sapply(split(id.d, id.d$ID), testlast))
>>>>
>>>> and use it for your data frame to subset/remove id.d$ID ==
>>>> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))
>> [1]
>>>> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>> FALSE
>>>> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>> FALSE
>>>> FALSE TRUE  TRUE [37]  TRUE  TRUE  TRUE  TRUE
>>>>
>>>> However I am not sure if this is exactly what you want.
>>>>
>>>> Regards
>>>> Petr
>>>>
>>>>> -----Original Message-----
>>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>>>> project.org] On Behalf Of Stuart Leask
>>>>> Sent: Tuesday, October 23, 2012 11:38 AM
>>>>> To: r-help at r-project.org
>>>>> Subject: [R] [r] How to pick colums from a ragged array?
>>>>>
>>>>> I have a large dataset (~1 million rows) of three variables: ID
>>>>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on
>>> that
>>>>> date).
>>>>> Patients may have been assigned more than one diagnosis at any
>> one
>>>>> appointment - leading to two rows, same ID and DATE but
>>>>> different DIAGNOSIS.
>>>>> The diagnoses may change between appointments.
>>>>>
>>>>> I want to subset the data in two ways:
>>>>>
>>>>> -          define groups of patients by the first diagnosis given
>>>>>
>>>>> -          define groups of patients by the last diagnosis given.
>>>>>
>>>>> The problem:
>>>>> Unfortunately, a small number of patients have been given more
>>>>> than one diagnosis at their first (or last) appointment. These
>>>>> individuals I need to identify and remove, as it's not possible
>> to
>>>>> say uniquely what their first (or last) diagnosis was. So I need
>>>>> to identify and remove these individuals which have pairs of
>>>>> rows with the same ID
>>>> and
>>>>> (lowest or highest) DATE. The size of the dataset precludes the
>>>> option
>>>>> of doing this by eye.
>>>>>
>>>>> I suspect there is a very elegant way of doing this in R.
>>>>>
>>>>> This is what I've come up with:
>>>>>
>>>>>
>>>>> -          Sort by DATE then ID
>>>>>
>>>>> -          Make a ragged array of DATE by ID
>>>>>
>>>>> -          Remove IDs that only occur once.
>>>>>
>>>>> -          Subtract the first and second DATEs. Remove IDs for
>>> which
>>>>> this = zero, as this will only be true for IDs for which the
>>>>> appointment is recorded twice (because there were two diagnoses
>>>>> recorded on this date).
>>>>>
>>>>> -          (Then do the same to get the 'last appointment'
>>>> duplicates,
>>>>> by reversing the initial sort by DATE.)
>>>>>
>>>>> I am stuck at the 'Subtract dates' step: I would like to get the
>>>>> data out of the ragged array by columns (so e.g. I end up with a
>>>>> matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out
>>>>> by column from the ragged array.
>>>>>
>>>>> I hope someone can help. My ugly code is below, with some data
>> for
>>>>> testing.
>>>>>
>>>>>
>>>>> Stuart
>>>>>
>>>>>
>>>>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
>>>>> Lecturer and Honorary Consultant Pychiatrist Institute of Mental
>>>>> Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU.
>>> UK
>>>>> Tel. +44
>>>>> 115 82 30419
>>>>>
>> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk
>>>>> Google 'Dr Stuart Leask'
>>>>>
>>>>>
>>>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>>>> ,1019)
>>>>>
>>>>> DATE <-
>>>>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>>>>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>>>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>>>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>>>>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>>>>> ,20091224,20050503,19870508,19880223,19880330)
>>>>>
>>>>> id.d <- cbind (ID,DATE )
>>>>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               #
>> create
>>>>> ragged array, 1-n DATES for every NAME
>>>>>
>>>>> # Inelegant attempt to remove IDs that only have one entry:
>>>>>
>>>>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum)             #add up
>>> the
>>>>> dates per row
>>>>> # Since DATE is in 'year mo da', if there's only one date, sum
>>>>> will
>>>> be
>>>>> less than 2100000:
>>>>> rag.t <- rag.s [ rag.s > 21000000 ]
>>>>> multi.dates <- rownames ( rag.t )                         # all
>> the
>>>> IDs
>>>>> with >1 date
>>>>> rag.am <- rag.a [ multi.dates ]                           #
>> rag.am
>>>> only
>>>>> has IDs with > 1 Date
>>>>>
>>>>>
>>>>> # But now I'm stuck.
>>>>> # Each row of the array is rag.am$ID.
>>>>> # So I can't pick columns of DATEs from the ragged array.
>>>>>
>>>>> This message and any attachment are intended solely for the
>>>>> addressee and may contain confidential information. If you have
>>>>> received this message in error, please send it back to me, and
>>>>> immediately delete
>>>> it.
>>>>> Please do not use, copy or disclose the information contained in
>>>>> this message or in any attachment.  Any views or opinions
>>>>> expressed by the author of this email do not necessarily reflect
>>>>> the views of the University of Nottingham.
>>>>>
>>>>> This message has been checked for viruses but the contents of an
>>>>> attachment may still contain software viruses which could damage
>>>>> your computer system:
>>>>> you are advised to perform your own checks. Email communications
>>>>> with the University of Nottingham may be monitored as permitted
>> by
>>>>> UK legislation.
>>>>>          [[alternative HTML version deleted]]
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-
>> project.org/posting-
>>>>> guide.html and provide commented, minimal, self-contained,
>>>>> reproducible code.
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>>> guide.html and provide commented, minimal, self-contained,
>>>> reproducible code.