[R] how to convert a data.frame to tree structure object such as dendrogram
Bert Gunter
gunter.berton at gene.com
Wed Mar 13 21:12:49 CET 2013
Here is a simpler, less clumsy version of my previous recursive R
solution that I sent you privately, which I'll also cc to the list
this time. It's now almost a one-liner.
To avoid problems with unused factor levels, I still prefer to have
character vectors not factors, as the data frame columns so:
df <- data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa',
'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1',
'Bd2', 'C11','C12','C13'), stringsAsFactors=FALSE)
makeTree2 <-function(x, i,n)
{
if(i==n)df[x,i]
else {
spl <- split(x,df[x,i])
lapply(spl,function(x)makeTree2(x,i+1,n)) ##Can't use Recall()
}
}
This is now called as
> makeTree2(seq_len(nrow(df)),1,ncol(df)) ## no list structure needed for x
## yielding (with the root implicit now)
$A
$A$Aa
[1] "Aa1"
$A$Ab
[1] "Ab1" "Ab2"
$B
$B$Ba
[1] "Ba1"
$B$Bd
[1] "Bd2"
$C
$C$C1
[1] "C11"
$C$C2
[1] "C12"
$C$C3
[1] "C13"
On Wed, Mar 13, 2013 at 10:25 AM, Not To Miss <not.to.miss at gmail.com> wrote:
> The ideal solution, I think, is probably recursive. In the last min I
> decided to wrote a python script to do this ( use python instead of perl or
> R, because of python mutable dict data structure), although I had preferred
> to keep all my code in one R piece. I post code here just in case you are
> interested. It generates a dict of dict of dict ...
>
> Hopefully I would not get beaten up for posting python code in R mailing
> list. :-)
>
> import sys
> tree = {}
> ## input file is a table with columns TAB demilited
> for line in open(sys.argv[1]):
> if line.startswith('#'): continue
> items = line.strip().split('\t')
> tmp = tree
> for item in items:
> if not item in tmp:
> tmp[item]={}
> tmp = tmp[item]
>
> The tree looks like this for the example:
> {'A': {'Aa': {'Aa1': {}}, 'Ab': {'Ab1': {}, 'Ab2': {}}}, 'C': {'C3': {'C13':
> {}}, 'C2': {'C12': {}}, 'C1': {'C11': {}}}, 'B': {'Bd': {'Bd2': {}}, 'Ba':
> {'Ba1': {}}}}
>
> On Wed, Mar 13, 2013 at 10:35 AM, David Winsemius <dwinsemius at comcast.net>
> wrote:
>>
>>
>> On Mar 12, 2013, at 9:22 PM, Not To Miss wrote:
>>
>> Nope, Bert, you miss me? :-D
>>
>> I apologize that I didn't provide a more realistic example and describe
>> the problem more clearly. The real data are just too complicated to post in
>> emails, so I made up a simple example, which perhaps seems a little over
>> simplistic now, but the basic structure are the same. Here is a more
>> approapriate one:
>> >data.frame(a=c('A','A', 'A', 'B','B','C','C','C'), b=c('Aa',
>> > 'Ab','Ab','Ba','Bd', 'C1','C2','C3'), c=c('Aa1', 'Ab1', 'Ab2', 'Ba1', 'Bd2',
>> > 'C11','C12','C13'))
>> a b c
>> 1 A Aa Aa1
>> 2 A Ab Ab1
>> 3 A Ab Ab2
>> 4 B Ba Ba1
>> 5 B Bd Bd2
>> 6 C C1 C11
>> 7 C C2 C12
>> 8 C C3 C13
>>
>> The data structure to convert to:
>> |---Aa------Aa1
>> A---| /--Ab1
>> | |---Ab--|
>> | \--Ab2
>> | |---Ba------Ba1
>> B---|
>> | |---Bd------Bd2
>> |
>> | /---C1-----C11
>> C---|----C2-----C12
>> \---C3-----C13
>>
>> It's multi-level nested and I won't know how many rows and columns of the
>> data.frame ahead of time. I plan to write a perl script to do the
>> conversion, just more familiar, if it's not easy to do in R. Thanks Don and
>> Greg for suggesting solutions.
>>
>>
>> After a bit of coding I am going to say your proposed answer is wrong (or
>> at least improperly specified). The first level can be recovered as you
>> suggest :
>>
>> > sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]]) ])
>> $A
>> [1] "Aa" "Ab" "Ab"
>>
>> $B
>> [1] "Ba" "Bd"
>>
>> $C
>> [1] "C1" "C2" "C3"
>>
>>
>> But the second level cannot be as you imagined. The third level items
>> beginning with "C1" all get associated together and there are no terminal
>> nodes for C2 or C3 at the third level.
>>
>> > sapply(unique(dfrm[[2]]), function(x) dfrm[[3]][grep(x, dfrm[[3]]) ])
>> $Aa
>> [1] "Aa1"
>>
>> $Ab
>> [1] "Ab1" "Ab2"
>>
>> $Ba
>> [1] "Ba1"
>>
>> $Bd
>> [1] "Bd2"
>>
>> $C1
>> [1] "C11" "C12" "C13"
>>
>> $C2
>> character(0)
>>
>> $C3
>> character(0)
>>
>> lev1 <- sapply(unique(dfrm[[1]]), function(x) dfrm[[2]][grep(x, dfrm[[2]])
>> ])
>> lapply(lev1, function(ll) lapply(ll, function(lll) dfrm[[3]][grep(lll,
>> dfrm[[3]]) ]) )
>>
>> $A
>> $A[[1]]
>> [1] "Aa1"
>>
>> $A[[2]]
>> [1] "Ab1" "Ab2"
>>
>> $A[[3]]
>> [1] "Ab1" "Ab2"
>>
>>
>> $B
>> $B[[1]]
>> [1] "Ba1"
>>
>> $B[[2]]
>> [1] "Bd2"
>>
>>
>> $C
>> $C[[1]]
>> [1] "C11" "C12" "C13"
>>
>> $C[[2]]
>> character(0)
>>
>> $C[[3]]
>> character(0)
>>
>> --
>> David.
>>
>>
>>
>> On Tue, Mar 12, 2013 at 2:18 PM, Bert Gunter <gunter.berton at gene.com>
>> wrote:
>>>
>>> So Mr. "not.tomiss" missed?
>>>
>>> :(
>>>
>>> -- Bert
>>>
>>> On Tue, Mar 12, 2013 at 1:08 PM, David Winsemius <dwinsemius at comcast.net>
>>> wrote:
>>> >
>>> > On Mar 12, 2013, at 9:37 AM, Not To Miss wrote:
>>> >
>>> >> Thanks. Is there any more elegant solution? What if I don't know how
>>> >> many
>>> >> levels of nesting ahead of time?
>>> >
>>> > It's even worse than what you now offer as a potential complication.
>>> > You did not provide an example of a data object that would illustrate the
>>> > complexity of the task nor what you consider the correct procedure (i.e. the
>>> > order of the columns to be used for splitting) nor the correct results. The
>>> > task is woefully underspecified at the moment. It's a bit akin to asking
>>> > "how do I do classification" without saying what you what to classify.
>>> >
>>> > --
>>> > David.
>>> >>
>>> >>
>>> >> On Tue, Mar 12, 2013 at 8:51 AM, Greg Snow <538280 at gmail.com> wrote:
>>> >>
>>> >>> You can use the lapply or rapply functions on the resulting list to
>>> >>> break
>>> >>> each piece into a list itself, then apply the lapply or rapply
>>> >>> function to
>>> >>> those resulting lists, ...
>>> >>>
>>> >>>
>>> >>> On Mon, Mar 11, 2013 at 3:41 PM, Not To Miss
>>> >>> <not.to.miss at gmail.com>wrote:
>>> >>>
>>> >>>> Thanks. That's just an simple example - what if there are more
>>> >>>> columns and
>>> >>>> more rows? Is there any easy way to create nested list?
>>> >>>>
>>> >>>> Best,
>>> >>>> Zech
>>> >>>>
>>> >>>>
>>> >>>> On Mon, Mar 11, 2013 at 2:12 PM, MacQueen, Don <macqueen1 at llnl.gov>
>>> >>>> wrote:
>>> >>>>
>>> >>>>> You will have to decide what R data structure is a "tree
>>> >>>>> structure". But
>>> >>>>> maybe this will get you started:
>>> >>>>>
>>> >>>>>> foo <- data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd'))
>>> >>>>>> split(foo$y, foo$x)
>>> >>>>> $A
>>> >>>>> [1] "Ab" "Ac"
>>> >>>>>
>>> >>>>> $B
>>> >>>>> [1] "Ba" "Bd"
>>> >>>>>
>>> >>>>> I suppose it is at least a little bit tree-like.
>>> >>>>>
>>> >>>>>
>>> >>>>> --
>>> >>>>> Don MacQueen
>>> >>>>>
>>> >>>>> Lawrence Livermore National Laboratory
>>> >>>>> 7000 East Ave., L-627
>>> >>>>> Livermore, CA 94550
>>> >>>>> 925-423-1062
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> On 3/10/13 9:19 PM, "Not To Miss" <not.to.miss at gmail.com> wrote:
>>> >>>>>
>>> >>>>>> I have a data.frame object like:
>>> >>>>>>
>>> >>>>>>> data.frame(x=c('A','A','B','B'), y=c('Ab','Ac','Ba','Bd'))
>>> >>>>>> x y
>>> >>>>>> 1 A Ab
>>> >>>>>> 2 A Ac
>>> >>>>>> 3 B Ba
>>> >>>>>> 4 B Bd
>>> >>>>>>
>>> >>>>>> how could I create a tree structure object like this:
>>> >>>>>> |---Ab
>>> >>>>>> A---|
>>> >>>>>> _| |---Ac
>>> >>>>>> |
>>> >>>>>> | |---Ba
>>> >>>>>> B---|
>>> >>>>>> |---Bb
>>> >>>>>>
>>> >>>>>> Thanks,
>>> >>>>>> Zech
>>> >>>>>>
>>> >>>>>> [[alternative HTML version deleted]]
>>> >>>>>>
>>> >>>>>> ______________________________________________
>>> >>>>>> R-help at r-project.org mailing list
>>> >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>>>>> PLEASE do read the posting guide
>>> >>>>>> http://www.R-project.org/posting-guide.html
>>> >>>>>> and provide commented, minimal, self-contained, reproducible code.
>>> >>>>>
>>> >>>>>
>>> >>>>
>>> >>>> [[alternative HTML version deleted]]
>>> >>>>
>>> >>>> ______________________________________________
>>> >>>> R-help at r-project.org mailing list
>>> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >>>> PLEASE do read the posting guide
>>> >>>> http://www.R-project.org/posting-guide.html
>>> >>>> and provide commented, minimal, self-contained, reproducible code.
>>> >>>>
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Gregory (Greg) L. Snow Ph.D.
>>> >>> 538280 at gmail.com
>>> >>>
>>> >>
>>> >> [[alternative HTML version deleted]]
>>> >>
>>> >> ______________________________________________
>>> >> R-help at r-project.org mailing list
>>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>>> >> PLEASE do read the posting guide
>>> >> http://www.R-project.org/posting-guide.html
>>> >> and provide commented, minimal, self-contained, reproducible code.
>>> >
>>> > David Winsemius
>>> > Alameda, CA, USA
>>> >
>>> > ______________________________________________
>>> > R-help at r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-help
>>> > PLEASE do read the posting guide
>>> > http://www.R-project.org/posting-guide.html
>>> > and provide commented, minimal, self-contained, reproducible code.
>>>
>>>
>>>
>>> --
>>>
>>> Bert Gunter
>>> Genentech Nonclinical Biostatistics
>>>
>>> Internal Contact Info:
>>> Phone: 467-7374
>>> Website:
>>>
>>> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>>
>>
>>
>> David Winsemius
>> Alameda, CA, USA
>>
>
--
Bert Gunter
Genentech Nonclinical Biostatistics
Internal Contact Info:
Phone: 467-7374
Website:
http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
More information about the R-help
mailing list