[R] Finding overlaps in vector
David Winsemius
dwinsemius at comcast.net
Sat Dec 22 15:04:05 CET 2007
Johannes Graumann <johannes_graumann at web.de> wrote in
news:fkinut$re4$1 at ger.gmane.org:
> But cutree does away with the indexes from the original input, which
> rect.hclust retains.
> I will have no other choice and match that input with the 'values'
> contained in the clusters ...
If you want to retain the original rownames, then try:
> vector
[1] 0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00
7.10 8.00
#-----start cut-and-pastable-----
#this will "label" individual group membership
#diff(.) returns a vector that is smaller by one than its input
#so it needs to be augmented with c(1,fn(diff((.))
grp.v<-cbind(vector,(c(1,1+cumsum(as.numeric(diff(vector)>0.5)))))
#You can then tally up the counts in groups
tb<-table(grp.v[,2])
tb
#1 2 3 4 5 6 7 8
#2 1 1 5 1 2 2 1
# And apply the counts to the rows by doing a
# "row count" lookup into tb[.]
grp.v<-cbind(grp.v,tb[grp.v[,2]])
grp.v
-----end cut and pastable------
vector
1 0.00 1 2
1 0.45 1 2
2 1.00 2 1
3 2.00 3 1
4 3.00 4 5
4 3.25 4 5
4 3.33 4 5
4 3.75 4 5
4 4.10 4 5
5 5.00 5 1
6 6.00 6 2
6 6.45 6 2
7 7.00 7 2
7 7.10 7 2
8 8.00 8 1
Further processing of the membership "label" might better be accomplished
by converting the matrix to a dataframe, and then working with the
membership "label" as a factor. If you only want to deal with the
rownames and values of vector that have more than <x> values, that should
be straightforward.
--
David Winsemius
> Gabor Grothendieck wrote:
>
>> If we don't need any plotting we don't really need rect.hclust at
>> all. Split the output of cutree, instead. Continuing from the
>> prior code:
>>
>>> for(el in split(unname(vv), names(vv))) print(el)
>> [1] 0.00 0.45
>> [1] 1
>> [1] 2
>> [1] 3.00 3.25 3.33 3.75 4.10
>> [1] 5
>> [1] 6.00 6.45
>> [1] 7.0 7.1
>> [1] 8
>>
>> On Dec 21, 2007 3:24 PM, Johannes Graumann <johannes_graumann at web.de>
>> wrote:
>>> Hm, hm, rect.hclust doesn't accept "plot=FALSE" and cutree doesn't
>>> retain the indexes of membership ... anyway short of ripping out the
>>> guts of rect.hclust to achieve the same result without an active
>>> graphics device?
>>>
>>> Joh
>>>
>>>
>>> >> # cluster and plot
>>> >> hc <- hclust(dist(v), method = "single")
>>> >> plot(hc, lab = v)
>>> >> cl <- rect.hclust(hc, h = .5, border = "red")
>>> >>
>>> >> # each component of list cl is one cluster. Print them out.
>>> >> for(idx in cl) print(unname(v[idx]))
>>> > [1] 8
>>> > [1] 7.0 7.1
>>> > [1] 6.00 6.45
>>> > [1] 5
>>> > [1] 3.00 3.25 3.33 3.75 4.10
>>> > [1] 2
>>> > [1] 1
>>> > [1] 0.00 0.45
>>> >
>>> >> # a different representation of the clusters
>>> >> vv <- v
>>> >> names(vv) <- ct <- cutree(hc, h = .5)
>>> >> vv
>>> > 1 1 2 3 4 4 4 4 4 5 6 6 7
>>> > 7
>>> > 8
>>> > 0.00 0.45 1.00 2.00 3.00 3.25 3.33 3.75 4.10 5.00 6.00 6.45 7.00
>>> > 7.10 8.00
>>> >
>>> >
>>> > On Dec 21, 2007 4:56 AM, Johannes Graumann
>>> > <johannes_graumann at web.de> wrote:
>>> >> <posted & mailed>
>>> >>
>>> >> Dear all,
>>> >>
>>> >> I'm trying to solve the problem, of how to find clusters of
>>> >> values in a vector that are closer than a given value.
>>> >> Illustrated this might look as follows:
>>> >>
>>> >> vector <- c(0,0.45,1,2,3,3.25,3.33,3.75,4.1,5,6,6.45,7,7.1,8)
>>> >>
>>> >> When using '0.5' as the proximity requirement, the following
>>> >> groups would result:
>>> >> 0,0.45
>>> >> 3,3.25,3.33,3.75,4.1
>>> >> 6,6.45
>>> >> 7,7.1
>>> >>
>>> >> Jim Holtman proposed a very elegant solution in
>>> >> http://tolstoy.newcastle.edu.au/R/e2/help/07/07/21286.html, which
>>> >> I have modified and perused since he wrote it to me. The beauty
>>> >> of this approach is that it will not only work for constant
>>> >> proximity requirements as above, but also for overlap-windows
>>> >> defined in terms of ppm around each value. Now I have an
>>> >> additional need and have found no way (short of iteratively step
>>> >> through all the groups returned) to figure out how to do that
>>> >> with Jim's approach: how to figure out that 6,6.45 and 7,7.1 are
>>> >> separate clusters?
>>> >>
>>> >> Thanks for any hints, Joh
>>> >>
More information about the R-help
mailing list