[R] R - Populate Another Variable Based on Multiple Conditions | For a Large Dataset

Sun Jul 3 14:16:11 CEST 2016

The data set did not show up. The R-help list tends to strip out most file types as a safety precaution.  Try renaming the file from xxx.csv to xxx.txt and it should come through alright.

John Kane
Kingston ON Canada

> -----Original Message-----
> From: kwamae at kemri-wellcome.org
> Sent: Sun, 3 Jul 2016 09:39:59 +0000
> To: jdnewmil at dcn.davis.ca.us, r-help at r-project.org
> Subject: Re: [R] R - Populate Another Variable Based on Multiple
> Conditions | For a Large Dataset
> 
> Hi Jeff, pardon me, I was surely not making it easy. I hope this time I
> will ☺
> 
> Attached is snippet of the dataset in csv format and below is the
> R.script I have managed so far.
> 
> -----------------------------------------------------------------------------------------------------------------------------------------------
> -----------------------------------------------------------------------------------------------------------------------------------------------
> 
> drug_study <- read.csv("drug_study.csv", header = T); head(drug_study)
> drug_study$date <- as.Date(drug_study$date, "%m/%d/%Y")
> drug_study$study_id <- ""  #create new column
> 
> individual <- unique (drug_study$ID)  #vector of individuals
> datalength <- dim(drug_study)[1]      #number of rows in dataframe
> 
> for (i in 1:length(individual)) {
>   for (j in 1:datalength) {
>     start_admin <- drug_study[c(drug_study$ID == individual[i] &
> drug_study$year == 2007 & drug_study$drug_admin == "Y" & drug_study$month
> == 5),2]  #capture date of start
>     end_admin <- drug_study[(drug_study$ID == individual[i] &
> drug_study$year == 2008 & drug_study$drug_admin == "Y" & drug_study$month
> == 2),2]    #capture date of end
> 
>     if(drug_study[j,1] == individual[i] & drug_study[j,2] >= start_admin
> & drug_study[j,2] < end_admin) {
>       drug_study[j,6] <- paste(start_admin) #populate respective row if
> condition is met
>     }
>   }
> }
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> For this dataset, there exists three individuals, J1/3, R1/3, R10/1.
> 
> The script works for the last two individuals but not J1/3 with the error
> below:
> 
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> Error in if (drug_study[j, 1] == individual[i] & drug_study[j, 2] >=
> start_admin &  :
>   argument is of length zero
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> 
> I figured it’s because this individuals start_admin and end_admin dates
> aren’t captured because the if-loop fails. There’s my first problem,
> there are thousands of individuals with varying
> start_admin and end_admin dates and I need a script to capture these for
> every individual.
> 
> Secondly, the above script is taking almost an hour to run for the entire
> dataset, just for the individuals whose start_admin and end_admin dates
> can be captured by the if-loop.
> 
> I need help in coming up with a script that will tackle the problem
> taking into account the different start_admin and end_admin dates and be
> resourceful with regards to time.
> 
> Regards
> -------------------------------------------------------------------------------
> Kevin Kariuki
> 
> ###############################################################################################################################################
> ###############################################################################################################################################
> 
> On 7/3/16, 8:42 AM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote:
> 
> You are making this hard on yourself by not paying attention the Posting
> Guide listed in the footer of every email on this list. You would
> probably also find [1] helpful also.
> 
> [1]
> http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example
> --
> Sent from my phone. Please excuse my brevity.
> 
> On July 2, 2016 3:41:07 PM PDT, Kevin Wamae <KWamae at kemri-wellcome.org>
> wrote:
> >Hi Jeff, sorry for referring to you as Jennifer earlier, accept my
> >apologies.
>> 
> >I attached a sample dataset in the question, am afraid it must have
> >failed to attach.
>> 
> >I have attached it again..
>> 
>> 
> >Regards
> >-------------------------------------------------------------------------------
> >Kevin Kariuki
>> 
>> 
> >On 7/2/16, 7:37 PM, "Jeff Newmiller" <jdnewmil at dcn.davis.ca.us> wrote:
>> 
> >I can understand you not wanting to supply your actual data online, but
> >only you know what your data looks like so only you can create a
> >simulated data set that we could show you how to work with.
> >--
> >Sent from my phone. Please excuse my brevity.
>> 
> >On July 2, 2016 2:57:39 AM PDT, Kevin Wamae <KWamae at kemri-wellcome.org>
> >wrote:
> >>I have a drug-trial study dataset (attached image).
>>> 
> >>Since its a large and complex dataset (at least to me) and I hope to
> >be
> >>as clear as possible with my question.
> >>The dataset is from a study where individuals are given drugs and
> >>followed up over a period spanning two consecutive years. Individuals
> >>do not start treatment on the same day and once they start, the
> >>variable "drug-admin" is marked "x" as well as the time they stop
> >>treatment in the following year.
> >>There exists another variable, "study_id", that I hope to populate as
> >>can be seen in the dataset, with the following conditions:
>>> 
> >>For every individual
> >>•    if the individual has entries that show they received drugs both
> >>on the start and end date (marked with the "x")
> >>•    if the start of drug administration falls in month == 2 | 3 and
> >>end of administration falls in month == 2 | 4
> >>•    then, using the date that marks the start of drug administration,
> >>populate the variable _"study_id"_ in all the rows that fall within
> >the
> >>timeframe that the individual was given drugs but excluding the end of
> >>drug administration.
> >>I have tried my level best and while I have explored several examples
> >>online, I haven't managed to solve this. The dataset contains close to
> >>6000 individuals spanning 10 years and my best bet was to use a loop
> >>which keeps crushing R after running for close to 30min. I have also
> >>read that dplyr may do the job but my attempts have been in vain.
>>> 
> >>sample code
> >>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
> >>individual <- unique (df$ID)  #vector of individuals
> >>datalength <- dim(df)[1]      #number of rows in dataframe
>>> 
> >>for (i in 1:length(individual)) {
>>>  for (j in 1:datalength) {
> >>start_admin <- df[(df$year == 2007] & df$drug_admin == "x" &
> >c(df$month
> >>== 2 | df$month == 3),1]  #capture date of start
> >>end_admin <- df[(df$year == 2008] & df$drug_admin == "x" & c(df$month
> >>== 2 | df$month == 4),1]    #capture date of end
>>> 
> >>if(df[datalength,1] == individual(i) & df[datalength,2] >= start_admin
> >>& df[datalength,2] < end_admin) {
> >>df[datalength,6] <- start_admin #populate respective row if condition
> >>is met
>>>      }
>>>    }
>>>  }
>>> 
> >>-------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>> 
> >>Above is the code that keeps failing..
>>> 
> >>Any help is highly appreciated....
>>> 
>>> 
> >>______________________________________________________________________
>>> 
> >>This e-mail contains information which is confidential. It is intended
> >>only for the use of the named recipient. If you have received this
> >>e-mail in error, please let us know by replying to the sender, and
> >>immediately delete it from your system.  Please note, that in these
> >>circumstances, the use, disclosure, distribution or copying of this
> >>information is strictly prohibited. KEMRI-Wellcome Trust Programme
> >>cannot accept any responsibility for the  accuracy or completeness of
> >>this message as it has been transmitted over a public network.
> >Although
> >>the Programme has taken reasonable precautions to ensure no viruses
> >are
> >>present in emails, it cannot accept responsibility for any loss or
> >>damage arising from the use of the email or attachments. Any views
> >>expressed in this message are those of the individual sender, except
> >>where the sender specifically states them to be the views of
> >>KEMRI-Wellcome Trust Programme.
> >>______________________________________________________________________
>>> 
>>> 
> >>------------------------------------------------------------------------
>>> 
> >>______________________________________________
> >>R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>https://stat.ethz.ch/mailman/listinfo/r-help
> >>PLEASE do read the posting guide
> >>http://www.R-project.org/posting-guide.html
> >>and provide commented, minimal, self-contained, reproducible code.
>> 
>> 
>> 
>> 
> >______________________________________________________________________
>> 
> >This e-mail contains information which is confidential. It is intended
> >only for the use of the named recipient. If you have received this
> >e-mail in error, please let us know by replying to the sender, and
> >immediately delete it from your system.  Please note, that in these
> >circumstances, the use, disclosure, distribution or copying of this
> >information is strictly prohibited. KEMRI-Wellcome Trust Programme
> >cannot accept any responsibility for the  accuracy or completeness of
> >this message as it has been transmitted over a public network. Although
> >the Programme has taken reasonable precautions to ensure no viruses are
> >present in emails, it cannot accept responsibility for any loss or
> >damage arising from the use of the email or attachments. Any views
> >expressed in this message are those of the individual sender, except
> >where the sender specifically states them to be the views of
> >KEMRI-Wellcome Trust Programme.
> >______________________________________________________________________
> 
> 
> 
> 
> ______________________________________________________________________
> 
> This e-mail contains information which is confidential. It is intended
> only for the use of the named recipient. If you have received this e-mail
> in error, please let us know by replying to the sender, and immediately
> delete it from your system.  Please note, that in these circumstances,
> the use, disclosure, distribution or copying of this information is
> strictly prohibited. KEMRI-Wellcome Trust Programme cannot accept any
> responsibility for the  accuracy or completeness of this message as it
> has been transmitted over a public network. Although the Programme has
> taken reasonable precautions to ensure no viruses are present in emails,
> it cannot accept responsibility for any loss or damage arising from the
> use of the email or attachments. Any views expressed in this message are
> those of the individual sender, except where the sender specifically
> states them to be the views of KEMRI-Wellcome Trust Programme.
> ______________________________________________________________________
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

____________________________________________________________
Can't remember your password? Do you need a strong and secure password?
Use Password manager! It stores your passwords & protects your account.