[R] fast subsetting of lists in lists

Tue Dec 7 20:59:05 CET 2010

Hi Bert,

thank you for your suggestion. I'm sure it's a good one. But my intention in first place was to 
learn about getting subsets of list nested in lists the fast way (and preferably also the easy way, 
but that is only my laziness).

It seems this thread is getting a bit long and also leads to some confusion (at least on my side as 
an R beginner). As this is definitely not my aim, I would suggest to close it. Thanks everyone for 
your kind hints and input, I learned some interesting things.

All the best

Alex

Am 07.12.2010 20:38, schrieb Bert Gunter:
> Alexander:
>
> I'm not sure exactly what you want, so the following may be irrelevant...
>
> BUT, noting that data frames ARE lists and IF what you have can then
> be abstracted as lists of lists of lists of ... to various depths
> AND IF what you want is just to pick out and combined all named
> vectors (which could be  columns of data frames) with a given name at
> whatever depth they they appear in the lists
> THEN it is natural to do this recursively as follows:
>
> vaccuum<- function(x,nm)
> ## x is a named list (of lists of ...)
> ## nm in the name searched for
> {
>    y<- NULL
>    for(nmx in names(x)) y<-  c(y,
>    {
> 	z<- x[[nmx]]
> 	if(nmx==nm)z
> 	else if(is.list(z))Recall(z,nm)
> 	})
>   y
> }
>
> ##Example
>
>> test<- list(a=1:3, be = list(a=4:6,c=data.frame(a=10:15,b=I(letters[1:6]))))
>
>> vaccuum(test,"b")
> [1] "a" "b" "c" "d" "e" "f"
>
>> vaccuum(test,"a")
>   [1]  1  2  3  4  5  6 10 11 12 13 14 15
>
> caveat: If perchance this is (at least close to) what you want, it is
> likely to be rather inefficient. It also needs some robustifying.
>
> Cheers,
> Bert
>
>
>
>
>
>
>
>
>
> On Tue, Dec 7, 2010 at 10:54 AM, Alexander Senger
> <senger at physik.hu-berlin.de>  wrote:
>> Hello,
>>
>> Matthew's hint is interesting:
>>
>> Am 07.12.2010 19:16, schrieb Matthew Dowle:
>>> Hello Alex,
>>>
>>> Assuming it was just an inadequate example (since a data.frame would suffice
>>> in that case), did you know that a data.frames' columns do not have to be
>>> vectors but can be lists?  I don't know if that helps.
>>>
>>>> DF = data.frame(a=1:3)
>>>> DF$b = list(pi, 2:3, letters[1:5])
>>>> DF
>>>    a             b
>>> 1 1      3.141593
>>> 2 2          2, 3
>>> 3 3 a, b, c, d, e
>>>> DF$b
>>> [[1]]
>>> [1] 3.141593
>>>
>>> [[2]]
>>> [1] 2 3
>>>
>>> [[3]]
>>> [1] "a" "b" "c" "d" "e"
>>>> sapply(DF,class)
>>>          a         b
>>> "integer"    "list"
>>>>
>>>
>>> That is still regular though in the sense that each row has a value for all
>>> the columns, even if that value is NA, or NULL in lists.
>>
>> My data is mostly regular, that is every sublist contains a data.frame
>> which is the major contribution to overall size. The reason I use lists
>> is mainly that I need also some bits of information about the
>> environment. I thought about putting these into additional columns of
>> the data.frame (and add redundancy and maybe 30% of overhead this way),
>> one column per variable. But as memory usage is already close to the
>> limit of my machine this might break things (the situation is a bit
>> tricky, isn't it?).
>> I didn't know that a column of a data.frame can be a list. So if I need
>> only let's say 10 entries in that list, but my data.frame has several
>> hundred rows, would the "empty" parts of the "column-list" be filled
>> with cycled values or would they be really empty and thus not use
>> additional memory?
>> Secondly as I mentioned in another email to this topic: a whole day of
>> data contains about 100 chunks of data that is 100 of the sublists
>> described above. I could put them all into one large data.frame, but
>> then I would have to extract the "environmental data" from the long
>> list, now containing repeated occurrences of variables with the same
>> name. I guess subsetting could become tricky here (dependend on name and
>> position, I assume), but I'm eager to learn an easy way of doing so.
>>
>> Sorry for not submitting an illustrative example, but I'm afraid that
>> would be quite lengthy and not so illustrative any more.
>>
>> The data.table mentioned below seems to be an interesting alternative;
>> I'll definitely look into this. But it would also mean quite a bit of
>> homework, as far as I can see...
>>
>> Thanks
>>
>> Alex
>>
>>
>>> If your data is not regular then one option is to flatten it into
>>> (row,column,value) tuple similar to how sparse matrices are stored.  Your
>>> value column may be list rather than vector.
>>>
>>> Then (and yes you guessed this was coming) ... you can use data.table to
>>> query the flat structure quickly by setting a key on the first two columns,
>>> or maybe just the 2nd column when you need to pick out the values for one
>>> 'column' quickly for all 'rows'.
>>>
>>> There was a thread about using list() columns in data.table here :
>>>
>>> http://r.789695.n4.nabble.com/Suggest-a-cool-feature-Use-data-table-like-a-sorted-indexed-data-list-tp2544213p2544213.html
>>>
>>>> Does someone now a trick to do the same as above with the faster built-in
>>>> subsetting? Something like:
>>>> test[<somesubsettingmagic>]
>>>
>>> So in data.table if you wanted all the 'b' values,  you might do something
>>> like this :
>>>
>>> setkey(DT,column)
>>> DT[J("b"), value]
>>>
>>> which should return the list() quickly from the irregular data.
>>>
>>> Matthew
>>>
>>>
>>> "Alexander Senger"<senger at physik.hu-berlin.de>  wrote in message
>>> news:4CFE6AEE.6030204 at physik.hu-berlin.de...
>>>> Hello Gerrit, Gabor,
>>>>
>>>>
>>>> thank you for your suggestion.
>>>>
>>>> Unfortunately unlist seems to be rather expensive. A short test with one
>>>> of my datasets gives 0.01s for an extraction based on my approach and
>>>> 5.6s for unlist alone. The reason seems to be that unlist relies on
>>>> lapply internally and does so recursively?
>>>>
>>>> Maybe there is still another way to go?
>>>>
>>>> Alex
>>>>
>>>> Am 07.12.2010 15:59, schrieb Gerrit Eichner:
>>>>> Hello, Alexander,
>>>>>
>>>>> does
>>>>>
>>>>> utest<- unlist(test)
>>>>> utest[ names( utest) == "a"]
>>>>>
>>>>> come close to what you need?
>>>>>
>>>>> Hth,
>>>>>
>>>>> Gerrit
>>>>>
>>>>>
>>>>> On Tue, 7 Dec 2010, Alexander Senger wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>>
>>>>>> my data is contained in nested lists (which seems not necessarily to be
>>>>>> the best approach). What I need is a fast way to get subsets from the
>>>>>> data.
>>>>>>
>>>>>> An example:
>>>>>>
>>>>>> test<- list(list(a = 1, b = 2, c = 3), list(a = 4, b = 5, c = 6),
>>>>>> list(a = 7, b = 8, c = 9))
>>>>>>
>>>>>> Now I would like to have all values in the named variables "a", that is
>>>>>> the vector c(1, 4, 7). The best I could come up with is:
>>>>>>
>>>>>> val<- sapply(1:3, function (i) {test[[i]]$a})
>>>>>>
>>>>>> which is unfortunately not very fast. According to R-inferno this is due
>>>>>> to the fact that apply and its derivates do looping in R rather than
>>>>>> rely on C-subroutines as the common [-operator.
>>>>>>
>>>>>> Does someone now a trick to do the same as above with the faster
>>>>>> built-in subsetting? Something like:
>>>>>>
>>>>>> test[<somesubsettingmagic>]
>>>>>>
>>>>>>
>>>>>> Thank you for your advice
>>>>>>
>>>>>>
>>>>>> Alex
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide
>>>>>> http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>