[R] [r] How to pick colums from a ragged array?
Stuart Leask
Stuart.Leask at nottingham.ac.uk
Tue Oct 23 13:37:30 CEST 2012
Thanks Rui - your initial, very elegant suggestion, has spurred me on!
1. As you noticed, my example data had no examples of duplicate first dates (DOH!)
I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
2. Your suggestion gave me all the duplicates:
how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
nd.b<- id.d [ how.many > 1, ]
3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
earliest <- tapply ( DATE, ID, min)
rownames(earliest[earliest%in%nd.b])
This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
I'm trying to work out why!
Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
Stuart
************************************************
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,999,1019,1019
,1019)
DATE <-
c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20050421,20060428,20060602,20060816,20061025,20061129
,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
,20091224,20050503,19870508,19870508,19880330)
id.d <- cbind (ID,DATE )
how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
nd.b<- id.d[how.many > 1, ]
earliest <- tapply ( DATE, ID, min) # table of earliest DATEs
rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the earliest date for that individual. I think...
******************************************************************
-----Original Message-----
From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
Sent: 23 October 2012 12:21
To: Stuart Leask
Cc: r-help at r-project.org
Subject: Re: [R] [r] How to pick colums from a ragged array?
Hello,
Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
,547,794,814,814,814,814,814,814,841,841,841,841,841
,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
,1019,1019)
DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
,20060111,20071119,20080107,20080407,20080521,20080711,20041005
,20070905,20020814,20021125,20040429,20040429,20071205,20080227
,20050421,20060130,20060428,20060602,20060816,20061025,20061129
,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
id.d <- cbind(ID, DATE)
getRepeat <- function(x, first = TRUE){
fun <- if(first) head else tail
sp <- split(data.frame(x), x[,1])
first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
n <- unlist(lapply(lst, sum))
sp1 <- sp[n > 1]
i1 <- lst[n > 1]
lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = FALSE) # to get the last ones
Hope this helps,
Rui Barradas
Em 23-10-2012 10:59, Rui Barradas escreveu:
> Hello,
>
> I'm not sure I understand it well, in the solution below the only
> returned value is ID == 814 but it's not the first nor the last DATE.
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> id.d[how.many > 1, ]
>
> See the help page for ?ave if the repetition of id.d[,1] is confusing.
> The first is the vector to average (to apply FUN to) and the second is
> one of thw two vectors defining the groups.
>
> Hope this helps,
>
> Rui Barradas
> Em 23-10-2012 10:37, Stuart Leask escreveu:
>> I have a large dataset (~1 million rows) of three variables: ID
>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>> date).
>> Patients may have been assigned more than one diagnosis at any one
>> appointment - leading to two rows, same ID and DATE but different
>> DIAGNOSIS.
>> The diagnoses may change between appointments.
>>
>> I want to subset the data in two ways:
>>
>> - define groups of patients by the first diagnosis given
>>
>> - define groups of patients by the last diagnosis given.
>>
>> The problem:
>> Unfortunately, a small number of patients have been given more than
>> one diagnosis at their first (or last) appointment. These individuals
>> I need to identify and remove, as it's not possible to say uniquely
>> what their first (or last) diagnosis was. So I need to identify and
>> remove these individuals which have pairs of rows with the same ID
>> and (lowest or highest) DATE. The size of the dataset precludes the
>> option of doing this by eye.
>>
>> I suspect there is a very elegant way of doing this in R.
>>
>> This is what I've come up with:
>>
>>
>> - Sort by DATE then ID
>>
>> - Make a ragged array of DATE by ID
>>
>> - Remove IDs that only occur once.
>>
>> - Subtract the first and second DATEs. Remove IDs for which
>> this = zero, as this will only be true for IDs for which the
>> appointment is recorded twice (because there were two diagnoses
>> recorded on this date).
>>
>> - (Then do the same to get the 'last appointment'
>> duplicates, by reversing the initial sort by DATE.)
>>
>> I am stuck at the 'Subtract dates' step: I would like to get the data
>> out of the ragged array by columns (so e.g. I end up with a matrix of
>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>> the ragged array.
>>
>> I hope someone can help. My ugly code is below, with some data for
>> testing.
>>
>>
>> Stuart
>>
>>
>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
>> and Honorary Consultant Pychiatrist Institute of Mental Health,
>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
>> 115 82 30419
>> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
>> Google 'Dr Stuart Leask'
>>
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <-
>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>> ,20091224,20050503,19870508,19880223,19880330)
>>
>> id.d <- cbind (ID,DATE )
>> rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create
>> ragged array, 1-n DATES for every NAME
>>
>> # Inelegant attempt to remove IDs that only have one entry:
>>
>> rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>> row # Since DATE is in 'year mo da', if there's only one date, sum
>> will be less than 2100000:
>> rag.t <- rag.s [ rag.s > 21000000 ]
>> multi.dates <- rownames ( rag.t ) # all the
>> IDs with >1 date
>> rag.am <- rag.a [ multi.dates ] # rag.am
>> only has IDs with > 1 Date
>>
>>
>> # But now I'm stuck.
>> # Each row of the array is rag.am$ID.
>> # So I can't pick columns of DATEs from the ragged array.
>>
>> This message and any attachment are intended solely for the addressee
>> and may contain confidential information. If you have received this
>> message in error, please send it back to me, and immediately delete
>> it. Please do not use, copy or disclose the information contained
>> in this message or in any attachment. Any views or opinions
>> expressed by the author of this email do not necessarily reflect the
>> views of the University of Nottingham.
>>
>> This message has been checked for viruses but the contents of an
>> attachment may still contain software viruses which could damage your
>> computer
>> system:
>> you are advised to perform your own checks. Email communications with
>> the University of Nottingham may be monitored as permitted by UK
>> legislation.
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list