[R] Reducing execution time
sri vathsan
srivibish at gmail.com
Wed Jul 27 20:27:42 CEST 2016
Hi,
It is not a just 79 triplets. As I said, there are 79 codes. I am making
triplets out of that 79 codes and matching the triplets in the list.
Please find the dput of the data below.
> dput(head(newd,10))
structure(list(uniq_id = c("1", "2", "3", "4", "5", "6", "7",
"8", "9", "10"), hi = c("11, 22, 84, 85, 108, 111", "18, 84, 85,
87, 122, 134",
"2, 18, 22", "18, 108, 122, 134, 176", "19, 85, 87, 100, 107",
"79, 85, 111", "11, 88, 108", "19, 88, 96", "19, 85, 96",
"19, 100, 103")), .Names = c("uniq_id", "hi"), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
>
I am trying to count the frequency of the triplets in the above data using
the below code.
# split column into a list
myList <- strsplit(newd$hi, split=",")
# get all pairwise combinations
myCombos <- t(combn(unique(unlist(myList)), 3))
# count the instances where the pair is present
myCounts <- sapply(1:nrow(myCombos), FUN=function(i) {
sum(sapply(myList, function(j) {
sum(!is.na(match(c(myCombos[i,]), j)))})==3)})
#final matrix
final <- cbind(matrix(as.integer(myCombos), nrow(myCombos)), myCounts)
I hope I made my point clear. Please let me know if I miss anything.
Regards,
Sri
On Wed, Jul 27, 2016 at 11:19 PM, Sarah Goslee <sarah.goslee at gmail.com>
wrote:
> You said you had 79 triplets and 8000 records.
>
> When I compared 100 triplets to 10000 records it took 86 seconds.
>
> So obviously there is something you're not telling us about the format
> of your data.
>
> If you use dput() to provide actual examples, you will get better
> results than if we on Rhelp have to guess. Because we tend to guess in
> ways that make the most sense after extensive R experience, and that's
> probably not what you have.
>
> Sarah
>
> On Wed, Jul 27, 2016 at 1:29 PM, sri vathsan <srivibish at gmail.com> wrote:
> > Hi,
> >
> > Thanks for the solution. But I am afraid that after running this code
> still
> > it takes more time. It has been an hour and still it is executing. I
> > understand the delay because each triplet has to compare almost 9000
> > elements.
> >
> > Regards,
> > Sri
> >
> > On Wed, Jul 27, 2016 at 9:02 PM, Sarah Goslee <sarah.goslee at gmail.com>
> > wrote:
> >>
> >> Hi,
> >>
> >> It's really a good idea to use dput() or some other reproducible way
> >> to provide data. I had to guess as to what your data looked like.
> >>
> >> It appears that order doesn't matter?
> >>
> >> Given than, here's one approach:
> >>
> >> combs <- structure(list(V1 = c(65L, 77L, 55L, 23L, 34L), V2 = c(23L,
> 34L,
> >> 34L, 77L, 65L), V3 = c(77L, 65L, 23L, 34L, 55L)), .Names = c("V1",
> >> "V2", "V3"), class = "data.frame", row.names = c(NA, -5L))
> >>
> >> dat <- list(
> >> c(77,65,34,23,55),
> >> c(65,23,77,65,55,34),
> >> c(77,34,65),
> >> c(55,78,56),
> >> c(98,23,77,65,34))
> >>
> >>
> >> sapply(seq_len(nrow(combs)), function(i)sum(sapply(dat,
> >> function(j)all(combs[i,] %in% j))))
> >>
> >> On a dataset of comparable time to yours, it takes me under a minute
> and a
> >> half.
> >>
> >> > combs <- combs[rep(1:nrow(combs), length=100), ]
> >> > dat <- dat[rep(1:length(dat), length=10000)]
> >> >
> >> > dim(combs)
> >> [1] 100 3
> >> > length(dat)
> >> [1] 10000
> >> >
> >> > system.time(test <- sapply(seq_len(nrow(combs)),
> >> > function(i)sum(sapply(dat, function(j)all(combs[i,] %in% j)))))
> >> user system elapsed
> >> 86.380 0.006 86.391
> >>
> >>
> >>
> >>
> >> On Wed, Jul 27, 2016 at 10:47 AM, sri vathsan <srivibish at gmail.com>
> wrote:
> >> > Hi,
> >> >
> >> > Apologizes for the less information.
> >> >
> >> > Basically, myCombos is a matrix with 3 variables which is a triplet
> that
> >> > is
> >> > a combination of 79 codes. There are around 3lakh combination as such
> >> > and
> >> > it looks like below.
> >> >
> >> > V1 V2 V3
> >> > 65 23 77
> >> > 77 34 65
> >> > 55 34 23
> >> > 23 77 34
> >> > 34 65 55
> >> >
> >> > Each triplet will compare in a list (mylist) having 8177 elements
> which
> >> > will looks like below.
> >> >
> >> > 77,65,34,23,55
> >> > 65,23,77,65,55,34
> >> > 77,34,65
> >> > 55,78,56
> >> > 98,23,77,65,34
> >> >
> >> > Now I want to count the no of occurrence of the triplet in the above
> >> > list.
> >> > I.e., the triplet 65 23 77 is seen 3 times in the list. So my output
> >> > looks
> >> > like below
> >> >
> >> > V1 V2 V3 Freq
> >> > 65 23 77 3
> >> > 77 34 65 4
> >> > 55 34 23 2
> >> >
> >> > I hope, I made it clear this time.
> >> >
> >> >
> >> > On Wed, Jul 27, 2016 at 7:00 PM, Bert Gunter <bgunter.4567 at gmail.com>
> >> > wrote:
> >> >
> >> >> Not entirely sure I understand, but match() is already vectorized, so
> >> >> you
> >> >> should be able to lose the supply(). This would speed things up a
> lot.
> >> >> Please re-read ?match *carefully* .
> >> >>
> >> >> Bert
> >> >>
> >> >> On Jul 27, 2016 6:15 AM, "sri vathsan" <srivibish at gmail.com> wrote:
> >> >>
> >> >> Hi,
> >> >>
> >> >> I created list of 3 combination numbers (mycombos, around 3 lakh
> >> >> combinations) and counting the occurrence of those combination in
> >> >> another
> >> >> list. This comparision list (mylist) is having around 8000 records.I
> am
> >> >> using the following code.
> >> >>
> >> >> myCounts <- sapply(1:nrow(myCombos), FUN=function(i) {
> >> >> sum(sapply(myList, function(j) {
> >> >> sum(!is.na(match(c(myCombos[i,]), j)))})==3)})
> >> >>
> >> >> The above code takes very long time to execute and is there any other
> >> >> effecting method which will reduce the time.
> >> >> --
> >> >>
> >> >> Regards,
> >> >> Srivathsan.K
> >> >>
> >
> >
> >
> >
>
--
Regards,
Srivathsan.K
Phone : 9600165206
[[alternative HTML version deleted]]
More information about the R-help
mailing list