[R] FW: [r] How to pick colums from a ragged array?
Rui Barradas
ruipbarradas at sapo.pt
Tue Oct 23 14:42:14 CEST 2012
Hello,
You're right, getRepeat returns a list of data.frames, one per each ID.
To put them all in the same df use
do.call(rbind, g.r)
Rui Barradas
Em 23-10-2012 13:36, Stuart Leask escreveu:
> Sorry, I must be a bit thick.!
> getRepeat gives me the data with duplicates - but I don't seem to be able to manipulate the result. It looks like a list of dataframes:
>
>> g.r<-getRepeat(id.d)
>> dim(g.r)
> NULL
>> summary(g.r)
> Length Class Mode
> [1,] 2 data.frame list
> [2,] 2 data.frame list
> [3,] 2 data.frame list
>
> This leaves me with the same problem I had with my ragged array i.e. how do I put all the second elements from this long list of data frames, into a single list?
>
> I need to end up with a list of all the IDs that have duplicate first (or last) DATES.
>
> Stuart
>
> -----Original Message-----
> From: Leask Stuart
> Sent: 23 October 2012 13:17
> To: 'Rui Barradas'
> Cc: 'r-help at r-project.org'
> Subject: RE: [R] [r] How to pick colums from a ragged array?
>
> Ah, no, my method does fail.
> Consider an ID that has a duplicate DATE that isn't the first date, but it's first date is the same as another ID's first date that IS a duplicate.
> Test data is all - see below it failing.
>
> So, I remain very grateful for your function!
>
> Stuart
>
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
> c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
> ,20091224,20050503,19870508,19870508,19880330)
>
> id.d <- cbind (ID,DATE )
> # rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create ragged array, 1-n DATES for every NAME
> # Inelegant attempt to remove IDs that only have one entry:
> # rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per row
> # Since DATE is in 'year mo da', if there's only one date, sum will be less than 2100000:
> # rag.t <- rag.s [ rag.s > 21000000 ]
> # multi.dates <- rownames ( rag.t ) # all the IDs with >1 date
> # rag.am <- rag.a [ multi.dates ] # rag.am only has IDs with > 1 Date
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> nd.b<- id.d[how.many > 1, ]
>
> #ni<-dim(nd.b)[1]
> #nd.IDs<-nd.b[1:(ni/2)*2,1] # list of IDs with dups
> #nd.DATEs<-nd.b[1:(ni/2)*2,2] # list of dup'd dates
>
> earliest<-tapply(DATE,ID,min) # table of mins
> rownames(earliest[earliest%in%nd.b]) # IDs of dups with min
> # This suggests ID 910 has a duplicate earliest, and it doesn't - it has a non-earliest duplicate, # and an earliest date that is the same as another ID's earliest+duplicate.
>
>
> -----Original Message-----
> From: Leask Stuart
> Sent: 23 October 2012 12:38
> To: 'Rui Barradas'
> Cc: r-help at r-project.org
> Subject: RE: [R] [r] How to pick colums from a ragged array?
>
> Thanks Rui - your initial, very elegant suggestion, has spurred me on!
>
> 1. As you noticed, my example data had no examples of duplicate first dates (DOH!) I have corrected this, and added a test - an ID that has a duplicate which is not the earliest DATE, but is the same DATE an earliest/duplicate for another ID.
>
> 2. Your suggestion gave me all the duplicates:
>
> how.many <- ave ( id.d [ ,1], id.d [,1], id.d [,2], FUN = length)
> nd.b<- id.d [ how.many > 1, ]
>
> 3. I can then simply make a table of earliest DATEs by ID, and then see which DATEs in this table are shared:
>
> earliest <- tapply ( DATE, ID, min)
> rownames(earliest[earliest%in%nd.b])
>
> This seems to work - and it does seem exclude IDs which have a duplicate date which is the same as a minimum date for another ID.
> I'm trying to work out why!
>
>
> Many, many thanks for the gift of that function. I will compare the two approaches (and assume that mine is flawed!).
>
>
> Stuart
>
>
> ************************************************
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
> ,1019)
>
> DATE <-
> c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514, 19870409,19870508,19870508, 20091120,20091210
> ,20091224,20050503,19870508,19870508,19880330)
>
> id.d <- cbind (ID,DATE )
>
> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
> nd.b<- id.d[how.many > 1, ]
>
> earliest <- tapply ( DATE, ID, min) # table of earliest DATEs
> rownames (earliest [earliest %in% nd.b ] ) # IDs of duplicates at the earliest date for that individual. I think...
>
>
>
>
> ******************************************************************
>
>
>
> -----Original Message-----
> From: Rui Barradas [mailto:ruipbarradas at sapo.pt]
> Sent: 23 October 2012 12:21
> To: Stuart Leask
> Cc: r-help at r-project.org
> Subject: Re: [R] [r] How to pick colums from a ragged array?
>
> Hello,
>
> Thinking again, if you just want the first/last in each ID that repeats the DATE, the following function does the job. Since there were no such cases in your data example, I've added 3 rows to the dataset.
>
> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
> ,547,794,814,814,814,814,814,814,841,841,841,841,841
> ,841,841,841,841,910,910,910,910,910,910,910,910,999,1019,1019
> ,1019,1019)
>
> DATE <- c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
> ,20070112,20070514,20091105,20091105,20091117,20091119,20091120,20091210
> ,20091224,20091224,20050503,19870508,19880223,19880330,19880330)
>
> id.d <- cbind(ID, DATE)
>
>
> getRepeat <- function(x, first = TRUE){
> fun <- if(first) head else tail
> sp <- split(data.frame(x), x[,1])
> first.date <- tapply(x[,2], x[,1], FUN = fun, 1)
> lst <- lapply(seq_along(sp), function(j) sp[[j]][,2] == first.date[j])
> n <- unlist(lapply(lst, sum))
> sp1 <- sp[n > 1]
> i1 <- lst[n > 1]
> lapply(seq_along(sp1), function(j) sp1[[j]][i1[[j]], ]) }
>
> getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d, first = FALSE) # to get the last ones
>
>
> Hope this helps,
>
> Rui Barradas
>
>
> Em 23-10-2012 10:59, Rui Barradas escreveu:
>> Hello,
>>
>> I'm not sure I understand it well, in the solution below the only
>> returned value is ID == 814 but it's not the first nor the last DATE.
>>
>> how.many <- ave(id.d[,1], id.d[,1], id.d[,2], FUN = length)
>> id.d[how.many > 1, ]
>>
>> See the help page for ?ave if the repetition of id.d[,1] is confusing.
>> The first is the vector to average (to apply FUN to) and the second is
>> one of thw two vectors defining the groups.
>>
>> Hope this helps,
>>
>> Rui Barradas
>> Em 23-10-2012 10:37, Stuart Leask escreveu:
>>> I have a large dataset (~1 million rows) of three variables: ID
>>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on that
>>> date).
>>> Patients may have been assigned more than one diagnosis at any one
>>> appointment - leading to two rows, same ID and DATE but different
>>> DIAGNOSIS.
>>> The diagnoses may change between appointments.
>>>
>>> I want to subset the data in two ways:
>>>
>>> - define groups of patients by the first diagnosis given
>>>
>>> - define groups of patients by the last diagnosis given.
>>>
>>> The problem:
>>> Unfortunately, a small number of patients have been given more than
>>> one diagnosis at their first (or last) appointment. These individuals
>>> I need to identify and remove, as it's not possible to say uniquely
>>> what their first (or last) diagnosis was. So I need to identify and
>>> remove these individuals which have pairs of rows with the same ID
>>> and (lowest or highest) DATE. The size of the dataset precludes the
>>> option of doing this by eye.
>>>
>>> I suspect there is a very elegant way of doing this in R.
>>>
>>> This is what I've come up with:
>>>
>>>
>>> - Sort by DATE then ID
>>>
>>> - Make a ragged array of DATE by ID
>>>
>>> - Remove IDs that only occur once.
>>>
>>> - Subtract the first and second DATEs. Remove IDs for which
>>> this = zero, as this will only be true for IDs for which the
>>> appointment is recorded twice (because there were two diagnoses
>>> recorded on this date).
>>>
>>> - (Then do the same to get the 'last appointment'
>>> duplicates, by reversing the initial sort by DATE.)
>>>
>>> I am stuck at the 'Subtract dates' step: I would like to get the data
>>> out of the ragged array by columns (so e.g. I end up with a matrix of
>>> ID, 1st DATE, 2nd DATE). But I can't get the dates out by column from
>>> the ragged array.
>>>
>>> I hope someone can help. My ugly code is below, with some data for
>>> testing.
>>>
>>>
>>> Stuart
>>>
>>>
>>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior Lecturer
>>> and Honorary Consultant Pychiatrist Institute of Mental Health,
>>> Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU. UK Tel. +44
>>> 115 82 30419
>>> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk>
>>> Google 'Dr Stuart Leask'
>>>
>>>
>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>> ,1019)
>>>
>>> DATE <-
>>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>>> ,20091224,20050503,19870508,19880223,19880330)
>>>
>>> id.d <- cbind (ID,DATE )
>>> rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) # create
>>> ragged array, 1-n DATES for every NAME
>>>
>>> # Inelegant attempt to remove IDs that only have one entry:
>>>
>>> rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up the dates per
>>> row # Since DATE is in 'year mo da', if there's only one date, sum
>>> will be less than 2100000:
>>> rag.t <- rag.s [ rag.s > 21000000 ]
>>> multi.dates <- rownames ( rag.t ) # all the
>>> IDs with >1 date
>>> rag.am <- rag.a [ multi.dates ] # rag.am
>>> only has IDs with > 1 Date
>>>
>>>
>>> # But now I'm stuck.
>>> # Each row of the array is rag.am$ID.
>>> # So I can't pick columns of DATEs from the ragged array.
>>>
>>> This message and any attachment are intended solely for the addressee
>>> and may contain confidential information. If you have received this
>>> message in error, please send it back to me, and immediately delete
>>> it. Please do not use, copy or disclose the information contained
>>> in this message or in any attachment. Any views or opinions
>>> expressed by the author of this email do not necessarily reflect the
>>> views of the University of Nottingham.
>>>
>>> This message has been checked for viruses but the contents of an
>>> attachment may still contain software viruses which could damage your
>>> computer
>>> system:
>>> you are advised to perform your own checks. Email communications with
>>> the University of Nottingham may be monitored as permitted by UK
>>> legislation.
>>> [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list