[R] Problem in selecting rows in R
Kristi Glover
kristi.glover at hotmail.com
Sun Mar 20 18:06:41 CET 2016
Dear Bert, Boris and Michael,
Thanks for your help. I noticed the example data was not compatible with my specification. The messy example embarrassed me. I tested the codes you sent me both codes worked perfectly fine. Please accept my sincere thanks.
KG
________________________________________
From: Boris Steipe <boris.steipe at utoronto.ca>
Sent: March 20, 2016 9:21 AM
To: Kristi Glover
Cc: R-help
Subject: Re: [R] Problem in selecting rows in R
Don't post in HTML, but thanks for providing dput() outout.
This code tries to follow your specifications, assuming by "time period" you mean days. The result doesn't look like your desired example though, because that is not compatible with your specs and I can't guess what you want to do differently. E.g your species 2 is observed on three different days:
6 6 Sp2 SiteB 2015-01-17 Jan
7 7 Sp2 SiteD 2015-01-29 Feb
8 8 Sp2 SiteC 2015-07-03 March
... yet your dat2 contains only the first observation in Jan.
Also note that there are conflicts like this:
3 3 Sp1 SiteA 2015-02-27
4 4 Sp1 SiteB 2015-02-27
... that you have not specified how to resolve. Is there an ordering
implied such that SiteB is the "last site"?
Finally, your data is messy: youhave Sp2 and SP3, SiteC and SIteD - if you don't pay better attention that's going to bite you one day.
Here's working code, you should be able to solve it entirely based on
this:
# Step 1: most of your columns are factors. This doesn't matter much
# except for Time. For time this is the Wrong Way, since time
# is ordered. We replace ...
datNew <- dat1
datNew$Time <- strptime(as.character(dat1$Time), format="%m/%d/%y")
# Step 2: Your problem is solved with a combination of order() and duplicated().
# In order to make the ordering of species/site unique we generate
# an auxiliary column.
datNew$key <- paste(datNew$Species, datNew$Time, sep=".")
# Step 3: Now we order the rows by $key and $Observed.site, in decreasing order
datNew <- datNew[ order(datNew$key, datNew$Observed.site, decreasing=TRUE), ]
# Step 4: Then we drop all but the first row from each species/site
# block. The first row is the last date (decreasing sort, remember!)
datNew <- datNew[!duplicated(datNew$key), ]
# Step 5: We can drop the keys and order ascending
# to make it look more like your example. Now we have
# one site on each day for each individual.
datNew <- datNew[order(datNew$key), colnames(datNew) != "key"]
datNew
B.
On Mar 20, 2016, at 11:14 AM, Michael Dewey <lists at dewey.myzen.co.uk> wrote:
> Dear Kristi
>
> You do not say what you have tried but I would have thought the key to a solution was to split your problem up by site and write a function to select the last observation within each site. I am not quite clear whether this is the last in time or the last occurring row so I just give a hint below as to functions
>
> ?aggregate
> ?by
> ?strplit
> ?lapply
>
> which may all be relevant here.
>
> On 20/03/2016 14:15, Kristi Glover wrote:
>> Hi R Users,
>> Some individuals recorded multiple within a time period. But, I want to select the row of last site within each time period for each individual. I spent a substantial time, but no luck in selecting the rows. Would you give me a hint for this one? I have a very large data set, but this is just an example.
>> Thanks for your help.
>>
>> I want to get dat2 from dat1.
>>
>> dat1<-structure(list(sn = 1:16, Species = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 4L, 5L, 5L, 5L, 5L, 5L), .Label = c("Sp1",
>> "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 3L, 1L, 2L, 1L, 2L, 4L, 3L, 1L, 2L, 2L, 3L, 5L, 4L, 1L, 3L), .Label = c("SiteA",
>> "SiteB", "SiteC", "SiteD", "SIteD"), class = "factor"), Time = structure(c(1L, 5L, 6L, 6L, 6L, 2L, 3L, 11L, 8L, 10L, 1L, 1L, 4L, 4L, 7L, 9L), .Label = c("1/15/15",
>> "1/17/15", "1/29/15", "2/17/15", "2/25/15", "2/27/15", "2/28/15", "3/27/15", "3/5/15", "4/19/15", "7/3/15"), class = "factor"),
>> Month = structure(c(3L, 2L, 2L, 2L, 2L, 3L, 2L, 4L, 4L, 1L, 3L, 3L, 2L, 2L, 2L, 4L), .Label = c("April", "Feb", "Jan",
>> "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -16L))
>> dat1
>> #---
>> dat2<-structure(list(sn = c(1L, 5L, 6L, 9L, 10L, 11L, 12L, 15L, 16L), Species = structure(c(1L, 1L, 2L, 3L, 3L, 4L, 5L, 5L, 5L), .Label = c("Sp1",
>> "Sp2", "SP3", "Sp4", "Sp5"), class = "factor"), Observed.site = structure(c(1L, 1L, 2L, 1L, 2L, 2L, 3L, 1L, 3L), .Label = c("SiteA", "SiteB",
>> "SiteC"), class = "factor"), Time = structure(c(1L, 3L, 2L, 5L, 7L, 1L, 1L, 4L, 6L), .Label = c("1/15/15", "1/17/15", "2/27/15",
>> "2/28/15", "3/27/15", "3/5/15", "4/19/15"), class = "factor"), Month = structure(c(3L, 2L, 3L, 4L, 1L, 3L, 3L, 2L, 4L), .Label = c("April",
>> "Feb", "Jan", "March"), class = "factor")), .Names = c("sn", "Species", "Observed.site", "Time", "Month"), class = "data.frame", row.names = c(NA, -9L))
>> dat2
>>
>> [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> --
> Michael
> http://www.dewey.myzen.co.uk/home.html
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list