[R] [r] How to pick colums from a ragged array?

Tue Oct 23 13:37:30 CEST 2012

Thanks Rui - your initial, very elegant suggestion, has spurred me on!

1. As you noticed, my example data had no examples of duplicate first dates (DOH!) 
I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.

2. Your suggestion gave me all the duplicates:

how.many  <-  ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many  > 1,  ]

3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:

earliest <- tapply ( DATE, ID, min)                
rownames(earliest[earliest%in%nd.b])   

This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!

Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).

Stuart

************************************************

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)

DATE <- 
 c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
 ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
 ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
 ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
 ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
 ,20091224,20050503,19870508,19870508,19880330)

 id.d <- cbind (ID,DATE )

how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]

earliest <- tapply  ( DATE, ID, min)                    # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] )   # IDs of duplicates at the earliest date for that individual. I think...

******************************************************************

-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt] 
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?

Hello,

Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.

ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)

DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)

id.d <- cbind(ID, DATE)

getRepeat <- function(x, first = TRUE){
     fun <- if(first) head else tail
     sp <- split(data.frame(x), x[,1])
     first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
     lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
     n <- unlist(lapply(lst, sum))
     sp1 <- sp[n > 1]
     i1 <- lst[n > 1]
     lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }

getRepeat(id.d)  # defaults to first = TRUE getRepeat(id.d, first = FALSE)  # to get the last ones

Hope this helps,

Rui Barradas

Em 23-10-2012 10:59, Rui Barradas escreveu:
> Hello,
>
> I'm not sure I understand it well, in the solution below the only 
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length) 
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing. 
> The first is the vector to average (to apply FUN to) and the second is 
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID 
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that 
>> date).
>> Patients may have been assigned more than one diagnosis at any one 
>> appointment - leading to two rows, same ID and DATE but different 
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> -          define groups of patients by the first diagnosis given
>>
>> -          define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than 
>> one diagnosis at their first (or last) appointment. These individuals 
>> I need to identify and remove, as it's not possible to say uniquely 
>> what their first (or last) diagnosis was. So I need to identify and 
>> remove these individuals which have pairs of rows with the same ID 
>> and (lowest or highest) DATE. The size of the dataset precludes the 
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> -          Sort by DATE then ID
>>
>> -          Make a ragged array of DATE by ID
>>
>> -          Remove IDs that only occur once.
>>
>> -          Subtract the first and second DATEs. Remove IDs for which 
>> this = zero, as this will only be true for IDs for which the 
>> appointment is recorded twice (because there were two diagnoses 
>> recorded on this date).
>>
>> -          (Then do the same to get the 'last appointment' 
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data 
>> out of the ragged array by columns (so e.g. I end up with a matrix of 
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from 
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for 
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer 
>> and Honorary Consultant Pychiatrist Institute of Mental Health, 
>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44 
>> 115 82 30419 
>> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <- 
>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a  <-  split ( id.d [ ,2 ], id.d [ ,1])               # create 
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply  (id.d [ ,2], id.d [ ,1], sum) #add up the dates per 
>> row # Since DATE is in 'year mo da', if there's only one date, sum 
>> will be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t )                         # all the 
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ]                           # rag.am 
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee 
>> and may contain confidential information. If you have received this 
>> message in error, please send it back to me, and immediately delete
>> it.   Please do not use, copy or disclose the information contained 
>> in this message or in any attachment.  Any views or opinions 
>> expressed by the author of this email do not necessarily reflect the 
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an 
>> attachment may still contain software viruses which could damage your 
>> computer
>> system:
>> you are advised to perform your own checks. Email communications with 
>> the University of Nottingham may be monitored as permitted by UK 
>> legislation.
>>     [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.