[R] [r] How to pick colums from a ragged array?
Rui Barradas
ruipbarradas at sapo.pt
Wed Oct 24 20:42:22 CEST 2012
Hello,
I just realized that function getRepLogical marks the second, not the
first (eventually from last) to be removed. The first tapply should be
dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2),
fromLast = TRUE))
in order to remove the first (or last).
Rui Barradas
Em 24-10-2012 18:41, Rui Barradas escreveu:
> Hello,
>
> Using one of Arun's ideas, some post ago, this new function returns a
> logical index into id.d of the rows that should be _removed_, hence
> rm1 and rm2. I think
>
>
>
> getRepLogical <- function(x, first = TRUE){
> fun <- if(first) head else tail
> dte <- tapply(x[,2], x[,1], FUN = function(x) duplicated(fun(x, 2)))
> len <- tapply(x[,2], x[,1], FUN = length)
> lst <- lapply(seq_along(dte), function(i) c(dte[[i]], rep(FALSE,
> if(len[[i]] > 2) len[[i]] - 2 else 0)))
> lst <- if(first) lst else lapply(lst, rev)
> i1 <- unlist(lst)
> dg <- tapply(x[,3], x[,1], FUN = function(x) !duplicated(fun(x, 2)))
> lst <- lapply(seq_along(dte), function(i) c(dg[[i]], rep(FALSE,
> if(len[[i]] > 2) len[[i]] - 2 else 0)))
> lst <- if(first) lst else lapply(lst, rev)
> i2 <- unlist(lst)
> i1 & i2
> }
>
> rm1 <- getRepLogical(id.d)
> rm2 <- getRepLogical(id.d, first = FALSE)
>
> id.d[rm1, ]
> id.d[rm2, ]
>
> id.d$INCLUDE <- !(rm1 | rm2)
>
>
> Hope this helps,
>
> Rui Barradas
> Em 24-10-2012 16:41, Stuart Leask escreveu:
>> (And, considering the real application, the functions ideally should
>> probably output a variable INCLUDE, the same length as the original
>> data, with TRUE and FALSE for whether or not that row should be
>> included...)
>>
>> -----Original Message-----
>> From: Leask Stuart
>> Sent: 24 October 2012 16:25
>> To: arun (smartpink111 at yahoo.com); 'PIKAL Petr'; Rui Barradas
>> (ruipbarradas at sapo.pt)
>> Subject: RE: [r] How to pick colums from a ragged array?
>>
>> Arun, Petr, Rui, many thanks for your help, and the functions you
>> have written.
>>
>> You'll recall I wanted to remove these first (or last) duplicates,
>> because they represented instances where two different diagnoses (in
>> this case, variable DG, value 1, 2, 3, 4 or 5) had been recorded on
>> the same day - so I can't say which was 'first' (or 'last').
>>
>> Your functions have revealed something I wasn't expecting: In some
>> cases, the diagnoses recorded on the duplicated DATEs are the same!
>> This is a surprise to me, but probably reflects someone going to two
>> different departments in a clinic, and both departments submit data.
>> I have to say that crazy things like this are often a feature of real
>> data, which I'm sure you've come across yourselves.
>>
>> Of course, I don't want to remove records in which I can determine an
>> unambiguous 'first diagnosis'.
>>
>> You have all put in so much effort on my behalf, I'm ashamed to ask,
>> but I wonder if any of the functions you've written could do this
>> with a little more
>> Indexing and the 'duplicate' function
>> So the function should only exclude an ID, having identified a first
>> (or last) DATE duplicate, the DGs for these two dates are different.
>>
>> Test dataset:
>>
>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>> ,1019)
>>
>> DATE <-
>> c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>> ,20060111,20071119,20080107,20080407,20080521,20080521,20041005
>> ,20070905,20020814,20021125,20040429,20040429,20071205,20071205
>> ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>> ,20070112,20070514, 19870508,20040205,20040205, 20080521,20080521
>> ,20091224,20050503,19870508,19870508,19880330)
>>
>> DG<-
>> c(1,2,1,1,4,4,3,2,3,2,1,2,3,2,1,2,2,2,2,2,2,1,2,1,1,1,1,1,1,4,3,3,3,4,3,2,2,2,1,1)
>>
>>
>> id.d<-data.frame(ID,DATE,DG)
>> id.d
>>
>> # Considering Ruis getRepeat function:
>>
>> g.r<-getRepeat(id.d) # defaults to first = TRUE getRepeat(id.d,
>> first = FALSE) to get the last ones
>> g.rr<-do.call(rbind, g.r) # put the data into a matrix
>>
>> # I can remove the date duplicates with:
>> g.rr[rep(!duplicated(g.rr)[(1:(dim(g.rr)[1]/2))*2],each=2),]
>>
>> I'm not sure how to add this to your suggestions, Arun & Petr...
>>
>>
>> Stuart
>>
>>
>> -----Original Message-----
>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>> Sent: 23 October 2012 15:24
>> To: Stuart Leask
>> Subject: RE: [r] How to pick colums from a ragged array?
>>
>> Hi
>>
>> I assumed that id.d is data frame
>>
>> id.d <- data.frame (ID,DATE )
>>
>> and
>>
>> fff(id.d)
>>
>> works for me
>>
>> Petr
>>
>>
>>> -----Original Message-----
>>> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
>>> Sent: Tuesday, October 23, 2012 3:13 PM
>>> To: PIKAL Petr
>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>
>>> Hi Petr.
>>> I see what you mean it should do, but when I run it I get an error
>>> (see below).
>>> Stuart
>>>
>>>
>>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>> + ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>> + ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>> + ,1019)
>>>> DATE <-
>>> + c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>>> + ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>> + ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>> + ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>>> + ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>>> + ,20091224,20050503,19870508,19870508,19880330)
>>>> id.d <- cbind (ID,DATE )
>>>> fff<-function(data, first=TRUE, remove=FALSE) {
>>> +
>>> + testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>>> + x[nrow(x),2]==x[nrow(x)-1,2]
>>> +
>>> + if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
>>> + data[,1]), testfirst))))) else sel <-
>>> + as.numeric(names(which(unlist(sapply(split(data, data[,1]),
>>> + testlast)))))
>>> +
>>> + if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,]
>>> + }
>>>> fff(id.d)
>>> Error in x[1, 2] : incorrect number of dimensions
>>> -----Original Message-----
>>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>>> Sent: 23 October 2012 13:51
>>> To: Stuart Leask; r-help at r-project.org
>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>
>>> Hi
>>>
>>>> -----Original Message-----
>>>> From: Stuart Leask [mailto:Stuart.Leask at nottingham.ac.uk]
>>>> Sent: Tuesday, October 23, 2012 2:29 PM
>>>> To: PIKAL Petr; r-help at r-project.org
>>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>>
>>>> Hi there.
>>>>
>>>> Not sure I follow what you are doing.
>>>>
>>>> I want a list of all the IDs that have duplicate DATE entries, only
>>>> when the DATE is the earliest (or last) date for that ID.
>>> And that is what the function (with 3 small modifications) does
>>>
>>>
>>> fff<-function(data, first=TRUE, remove=FALSE) {
>>>
>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>>> x[nrow(x),2]==x[nrow(x)-1,2]
>>>
>>> if(first) sel <- as.numeric(names(which(unlist(sapply(split(data,
>>> data[,1]), testfirst))))) else sel <-
>>> as.numeric(names(which(unlist(sapply(split(data, data[,1]),
>>> testlast)))))
>>>
>>> if (remove) data[!data[,1] %in% sel,] else data[data[,1] %in% sel,] }
>>>
>>> See the result of your refined data
>>>
>>> fff(id.d)
>>> ID DATE
>>> 5 167 2004-02-05
>>> 6 167 2004-02-05
>>> 22 841 2005-04-21
>>> 23 841 2005-04-21
>>> 24 841 2006-04-28
>>> 25 841 2006-06-02
>>> 26 841 2006-08-16
>>> 27 841 2006-10-25
>>> 28 841 2006-11-29
>>> 29 841 2007-01-12
>>> 30 841 2007-05-14
>>> 38 1019 1987-05-08
>>> 39 1019 1987-05-08
>>> 40 1019 1988-03-30
>>>> fff(id.d, first=F)
>>> ID DATE
>>> 5 167 2004-02-05
>>> 6 167 2004-02-05
>>>> fff(id.d, remove=T)
>>> ID DATE
>>> 1 58 2006-08-21
>>> 2 58 2006-12-07
>>> 3 58 2008-01-02
>>> 4 58 2009-09-04
>>> 7 323 2005-11-11
>>> 8 323 2006-01-11
>>> 9 323 2007-11-19
>>> 10 323 2008-01-07
>>> 11 323 2008-04-07
>>> 12 323 2008-05-21
>>> 13 323 2008-07-11
>>> 14 547 2004-10-05
>>> 15 794 2007-09-05
>>> 16 814 2002-08-14
>>> 17 814 2002-11-25
>>> 18 814 2004-04-29
>>> 19 814 2004-04-29
>>> 20 814 2007-12-05
>>> 21 814 2008-02-27
>>> 31 910 1987-05-08
>>> 32 910 2004-02-05
>>> 33 910 2004-02-05
>>> 34 910 2009-11-20
>>> 35 910 2009-12-10
>>> 36 910 2009-12-24
>>> 37 999 2005-05-03
>>> You can do surgery on fff function to see what result comes from some
>>> piece of the function e.g.
>>>
>>> sapply(split(id.d, id.d[,1]), testlast)
>>>
>>> Regards
>>> Petr
>>>
>>>> I have refined my test dataset, to include some tests (e.g. 910 has
>>>> the same dup as 1019, but for 910 it's not the earliest date):
>>>>
>>>>
>>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>>> ,1019)
>>>>
>>>> DATE <-
>>>> c(20060821,20061207,20080102,20090904,20040205,20040205,20051111
>>>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>>> ,20050421,20050421,20060428,20060602,20060816,20061025,20061129
>>>> ,20070112,20070514, 19870508,20040205,20040205, 20091120,20091210
>>>> ,20091224,20050503,19870508,19870508,19880330)
>>>>
>>>> Correct output:
>>>> "167" "841" "1019"
>>>>
>>>> Stuart
>>>>
>>>> -----Original Message-----
>>>> From: PIKAL Petr [mailto:petr.pikal at precheza.cz]
>>>> Sent: 23 October 2012 13:15
>>>> To: Stuart Leask; r-help at r-project.org
>>>> Subject: RE: [r] How to pick colums from a ragged array?
>>>>
>>>> Hi
>>>>
>>>> Rui's answer brought me to more elaborated solution which still
>>>> needs data frame to be ordered by date
>>>>
>>>> fff<-function(data, first=TRUE, remove=FALSE) {
>>>>
>>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>>>> x[length(x),2]==x[length(x)-1,2]
>>>>
>>>> if(first) sel <- as.numeric(names(which(sapply(split(data,
>>>> data[,1]),
>>>> testfirst)))) else sel <- as.numeric(names(which(sapply(split(data,
>>>> data[,1]), testlast))))
>>>>
>>>> if (remove) data[data[,1]!=sel,] else data[data[,1]==sel,] }
>>>>
>>>>
>>>>> fff(id.d)
>>>> ID DATE
>>>> 31 910 20091105
>>>> 32 910 20091105
>>>> 33 910 20091117
>>>> 34 910 20091119
>>>> 35 910 20091120
>>>> 36 910 20091210
>>>> 37 910 20091224
>>>> 38 910 20091224
>>>>
>>>>> fff(id.d, remove=T)
>>>> ID DATE
>>>> 1 58 20060821
>>>> 2 58 20061207
>>>> 3 58 20080102
>>>> 4 58 20090904
>>>> 5 167 20040205
>>>> 6 167 20040323
>>>> 7 323 20051111
>>>> 8 323 20060111
>>>> 9 323 20071119
>>>> 10 323 20080107
>>>> 11 323 20080407
>>>> 12 323 20080521
>>>> 13 323 20080711
>>>> 14 547 20041005
>>>> 15 794 20070905
>>>> 16 814 20020814
>>>> 17 814 20021125
>>>> 18 814 20040429
>>>> 19 814 20040429
>>>> 20 814 20071205
>>>> 21 814 20080227
>>>> 22 841 20050421
>>>> 23 841 20060130
>>>> 24 841 20060428
>>>> 25 841 20060602
>>>> 26 841 20060816
>>>> 27 841 20061025
>>>> 28 841 20061129
>>>> 29 841 20070112
>>>> 30 841 20070514
>>>> 39 999 20050503
>>>> 40 1019 19870508
>>>> 41 1019 19880223
>>>> 42 1019 19880330
>>>> 43 1019 19880330
>>>> Regards
>>>> Petr
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>>>> project.org] On Behalf Of PIKAL Petr
>>>>> Sent: Tuesday, October 23, 2012 1:49 PM
>>>>> To: Stuart Leask; r-help at r-project.org
>>>>> Subject: Re: [R] [r] How to pick colums from a ragged array?
>>>>>
>>>>> Hi
>>>>>
>>>>> I did not check your code and rather followed your explanation.
>>> BTW,
>>>>> thanks for test data.
>>>>>
>>>>> small change in data frame to make DATE as Date class
>>>>>
>>>>> datum<-as.Date(as.character(DATE), format="%Y%m%d") id.d <-
>>>>> data.frame(ID,datum )
>>>>>
>>>>> ordering by date
>>>>>
>>>>> id.d<-id.d[order(id.d$datum),]
>>>>>
>>>>>
>>>>> two functions to test if first two dates are the same or last two
>>>>> dates are the same
>>>>>
>>>>> testfirst <- function(x) x[1,2]==x[2,2] testlast <- function(x)
>>>>> x[length(x),2]==x[length(x)-1,2]
>>>>>
>>>>> change one last date in the data frame to be the same as previous
>>>>>
>>>>> id.d[35,2]<-id.d[36,2]
>>>>>
>>>>> and here are results
>>>>>
>>>>> sapply(split(id.d, id.d$ID), testlast)
>>>>> 58 167 323 547 794 814 841 910 999 1019
>>>>> FALSE FALSE FALSE NA NA FALSE FALSE TRUE NA FALSE
>>>>>
>>>>>> sapply(split(id.d, id.d$ID), testfirst)
>>>>> 58 167 323 547 794 814 841 910 999 1019
>>>>> FALSE FALSE FALSE NA NA FALSE FALSE FALSE NA FALSE
>>>>>
>>>>> Now you can select ID which is true and remove it from your data
>>>>> which(sapply(split(id.d, id.d$ID), testlast))
>>>>>
>>>>> and use it for your data frame to subset/remove id.d$ID ==
>>>>> as.numeric(names(which(sapply(split(id.d, id.d$ID), testlast))))
>>> [1]
>>>>> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>>> FALSE [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>> FALSE
>>>>> FALSE FALSE [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
>>>> FALSE
>>>>> FALSE TRUE TRUE [37] TRUE TRUE TRUE TRUE
>>>>>
>>>>> However I am not sure if this is exactly what you want.
>>>>>
>>>>> Regards
>>>>> Petr
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>>>>>> project.org] On Behalf Of Stuart Leask
>>>>>> Sent: Tuesday, October 23, 2012 11:38 AM
>>>>>> To: r-help at r-project.org
>>>>>> Subject: [R] [r] How to pick colums from a ragged array?
>>>>>>
>>>>>> I have a large dataset (~1 million rows) of three variables: ID
>>>>>> (patient's name), DATE (of appointment) and DIAGNOSIS (given on
>>>> that
>>>>>> date).
>>>>>> Patients may have been assigned more than one diagnosis at any
>>> one
>>>>>> appointment - leading to two rows, same ID and DATE but
>>>>>> different DIAGNOSIS.
>>>>>> The diagnoses may change between appointments.
>>>>>>
>>>>>> I want to subset the data in two ways:
>>>>>>
>>>>>> - define groups of patients by the first diagnosis given
>>>>>>
>>>>>> - define groups of patients by the last diagnosis given.
>>>>>>
>>>>>> The problem:
>>>>>> Unfortunately, a small number of patients have been given more
>>>>>> than one diagnosis at their first (or last) appointment. These
>>>>>> individuals I need to identify and remove, as it's not possible
>>> to
>>>>>> say uniquely what their first (or last) diagnosis was. So I need
>>>>>> to identify and remove these individuals which have pairs of
>>>>>> rows with the same ID
>>>>> and
>>>>>> (lowest or highest) DATE. The size of the dataset precludes the
>>>>> option
>>>>>> of doing this by eye.
>>>>>>
>>>>>> I suspect there is a very elegant way of doing this in R.
>>>>>>
>>>>>> This is what I've come up with:
>>>>>>
>>>>>>
>>>>>> - Sort by DATE then ID
>>>>>>
>>>>>> - Make a ragged array of DATE by ID
>>>>>>
>>>>>> - Remove IDs that only occur once.
>>>>>>
>>>>>> - Subtract the first and second DATEs. Remove IDs for
>>>> which
>>>>>> this = zero, as this will only be true for IDs for which the
>>>>>> appointment is recorded twice (because there were two diagnoses
>>>>>> recorded on this date).
>>>>>>
>>>>>> - (Then do the same to get the 'last appointment'
>>>>> duplicates,
>>>>>> by reversing the initial sort by DATE.)
>>>>>>
>>>>>> I am stuck at the 'Subtract dates' step: I would like to get the
>>>>>> data out of the ragged array by columns (so e.g. I end up with a
>>>>>> matrix of ID, 1st DATE, 2nd DATE). But I can't get the dates out
>>>>>> by column from the ragged array.
>>>>>>
>>>>>> I hope someone can help. My ugly code is below, with some data
>>> for
>>>>>> testing.
>>>>>>
>>>>>>
>>>>>> Stuart
>>>>>>
>>>>>>
>>>>>> Dr Stuart John Leask DM FRCPsych MB BChir MA Clinical Senior
>>>>>> Lecturer and Honorary Consultant Pychiatrist Institute of Mental
>>>>>> Health, Innovation Park Triumph Road, Nottingham, Notts. NG7 2TU.
>>>> UK
>>>>>> Tel. +44
>>>>>> 115 82 30419
>>>>>>
>>> stuart.leask at nottingham.ac.uk<mailto:stuart.leask at nottingham.ac.uk
>>>>>> Google 'Dr Stuart Leask'
>>>>>>
>>>>>>
>>>>>> ID <- c(58,58,58,58,167,167,323,323,323,323,323,323,323
>>>>>> ,547,794,814,814,814,814,814,814,841,841,841,841,841
>>>>>> ,841,841,841,841,910,910,910,910,910,910,999,1019,1019
>>>>>> ,1019)
>>>>>>
>>>>>> DATE <-
>>>>>> c(20060821,20061207,20080102,20090904,20040205,20040323,20051111
>>>>>> ,20060111,20071119,20080107,20080407,20080521,20080711,20041005
>>>>>> ,20070905,20020814,20021125,20040429,20040429,20071205,20080227
>>>>>> ,20050421,20060130,20060428,20060602,20060816,20061025,20061129
>>>>>> ,20070112,20070514,20091105,20091117,20091119,20091120,20091210
>>>>>> ,20091224,20050503,19870508,19880223,19880330)
>>>>>>
>>>>>> id.d <- cbind (ID,DATE )
>>>>>> rag.a <- split ( id.d [ ,2 ], id.d [ ,1]) #
>>> create
>>>>>> ragged array, 1-n DATES for every NAME
>>>>>>
>>>>>> # Inelegant attempt to remove IDs that only have one entry:
>>>>>>
>>>>>> rag.s <-tapply (id.d [ ,2], id.d [ ,1], sum) #add up
>>>> the
>>>>>> dates per row
>>>>>> # Since DATE is in 'year mo da', if there's only one date, sum
>>>>>> will
>>>>> be
>>>>>> less than 2100000:
>>>>>> rag.t <- rag.s [ rag.s > 21000000 ]
>>>>>> multi.dates <- rownames ( rag.t ) # all
>>> the
>>>>> IDs
>>>>>> with >1 date
>>>>>> rag.am <- rag.a [ multi.dates ] #
>>> rag.am
>>>>> only
>>>>>> has IDs with > 1 Date
>>>>>>
>>>>>>
>>>>>> # But now I'm stuck.
>>>>>> # Each row of the array is rag.am$ID.
>>>>>> # So I can't pick columns of DATEs from the ragged array.
>>>>>>
>>>>>> This message and any attachment are intended solely for the
>>>>>> addressee and may contain confidential information. If you have
>>>>>> received this message in error, please send it back to me, and
>>>>>> immediately delete
>>>>> it.
>>>>>> Please do not use, copy or disclose the information contained in
>>>>>> this message or in any attachment. Any views or opinions
>>>>>> expressed by the author of this email do not necessarily reflect
>>>>>> the views of the University of Nottingham.
>>>>>>
>>>>>> This message has been checked for viruses but the contents of an
>>>>>> attachment may still contain software viruses which could damage
>>>>>> your computer system:
>>>>>> you are advised to perform your own checks. Email communications
>>>>>> with the University of Nottingham may be monitored as permitted
>>> by
>>>>>> UK legislation.
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-
>>> project.org/posting-
>>>>>> guide.html and provide commented, minimal, self-contained,
>>>>>> reproducible code.
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-
>>>>> guide.html and provide commented, minimal, self-contained,
>>>>> reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list