[R] How to extract same columns from identical dataframes in a list?

Wed Feb 10 15:48:49 CET 2016

On 10 Feb 2016, at 10:04 , Wolfgang Waser <waser at frankenfoerder-fg.de> wrote:

> Hi,
> 
> sapply(l,"[",T,2)
> 
> and
> 
> sapply(l, function(e) e[, 2])
> 
> 
> work fine!
> 
> 
> Thanks a lot!
> 
> Why is the second version "brute force and ignorance"? Is one of the
> versions to be preferred? If so, which and why (very briefly, please)?

It's slightly less elegant and it requires you to set up an extra function, rather than just using the indexing operator as a function. 

On the other hand, it is the obvious generic approach: Write a function to do what you want for one element, then apply the function to each element with sapply(). The extra overhead is likely irrelevant. It is also more readable since you don't need to mentally keep track of things like the fact that "["(x,T,2) is the same as x[T,2].

> 
> 
> Results of the other options mentioned:
> 
>> sapply(l,"[[",2)
> 
> results in a single vector of length 7
> 
> 
>> sapply(l,"[",,2)
> Error in lapply(X = X, FUN = FUN, ...) :
> argument is missing, with no default
> 
> These versions probably don't work due the "data frames" in the list
> actually being matrices.

Exactly.

> 
> 
> I'm not enough of a programer to always make complete sense of the R
> help pages. Should I have found this information in the sapply - R help
> page?

Not really. Well, the brute-force-and-ignorance approach should transpire, but the sapply(l, "[",.....) stuff requires that you first understand that operators are really function calls, and the what arguments they take. The is part of a general understanding that won't fit in any individual help page.

> Where else could I check before pestering the R mailing list, which, of
> course, provides quick and valuable answers.

You may need someone who got intro'ed shorter time ago than me for that. There are multiple books on R programming and also the free manuals from CRAN could be useful.

-pd

> 
> 
> Cheers,
> 
> Wolfgang
> 
> 
> 
> 
> On 09/02/16 16:19, peter dalgaard wrote:
>> Like this?
>> 
>>> l <- replicate(3,data.frame(w1=sample(1:4),w2=sample(1:4)), simplify=FALSE)
>>> l
>> [[1]]
>>  w1 w2
>> 1  2  2
>> 2  3  3
>> 3  1  1
>> 4  4  4
>> 
>> [[2]]
>>  w1 w2
>> 1  3  4
>> 2  2  2
>> 3  1  3
>> 4  4  1
>> 
>> [[3]]
>>  w1 w2
>> 1  1  4
>> 2  4  3
>> 3  2  1
>> 4  3  2
>> 
>>> sapply(l,"[[",2)
>>     [,1] [,2] [,3]
>> [1,]    2    4    4
>> [2,]    3    2    3
>> [3,]    1    3    1
>> [4,]    4    1    2
>> 
>> Or even
>> 
>>> sapply(l,"[",,2)
>>     [,1] [,2] [,3]
>> [1,]    2    4    4
>> [2,]    3    2    3
>> [3,]    1    3    1
>> [4,]    4    1    2
>> 
>> 
>> Notice that if dd[1:24] gives you the 1st column, then dd is not a data frame but rather a matrix, and indexing semantics are different. In that case, for some unspeakable reason, the empty index does not work and you'll need something like
>> 
>>> l <- replicate(3,cbind(w1=sample(1:4),w2=sample(1:4)), simplify=FALSE)
>>> sapply(l,"[",T,2)
>>     [,1] [,2] [,3]
>> [1,]    4    3    2
>> [2,]    1    1    4
>> [3,]    3    2    3
>> [4,]    2    4    1
>> 
>> Or, brute-force-and-ignorance:
>> 
>>> sapply(l, function(e) e[, 2])
>>     [,1] [,2] [,3]
>> [1,]    4    3    2
>> [2,]    1    1    4
>> [3,]    3    2    3
>> [4,]    2    4    1
>> 
>> 
>> 
>> 
>> 
>> On 09 Feb 2016, at 10:03 , Wolfgang Waser <waser at frankenfoerder-fg.de> wrote:
>> 
>>> Hi,
>>> 
>>> sorry if my description was too short / unclear.
>>> 
>>>> I have a list of 7 data frames, each data frame having 24 rows (hour of
>>>> the day) and 5 columns (weeks) with a total of 5 x 24 values
>>> 
>>> [1]
>>> 	week1	week2	week3	...
>>> 1	x	a	m	...
>>> 2	y	b	n
>>> 3	z	c	o
>>> .	.	.	.
>>> .	.	.	.
>>> .	.	.	.
>>> 24	.	.	.
>>> 
>>> 
>>> [2]
>>> 	week1 week2 week3 ...
>>> 1	x2	a2	m2	...
>>> 2	y2	b2	n2
>>> 3	z2	c2	o2
>>> .	.	.	.
>>> .	.	.	.
>>> .	.	.	.
>>> 24	.	.	.
>>> 
>>> 
>>> [3]
>>> ...
>>> 
>>> .
>>> .
>>> .
>>> 
>>> 
>>> [7]
>>> ...
>>> 
>>> 
>>> 
>>> I now would like to extract e.g. all week2 columns of all data frames in
>>> the list and combine them in a new data frame using cbind.
>>> 
>>> new data frame
>>> 
>>> week2 ([1])	week2 ([2])	week2 ([3])	...
>>> a		a2		.
>>> b		b2		.
>>> c		c2		.
>>> .
>>> .
>>> .
>>> 
>>> I will then do further row-wise calculations using e.g. apply(x,1,mean),
>>> the result being a vector of 24 values.
>>> 
>>> 
>>> I have not found a way to extract specific columns of the data frames in
>>> a list.
>>> 
>>> 
>>> As mentioned I can use
>>> 
>>> sapply(list_of_dataframes,"[",1:24)
>>> 
>>> which will pick the first 24 values (first column) of each data frame in
>>> the list and arrange them as an array of 24 rows and 7 columns (7 data
>>> frames are in the list).
>>> To pick the second column (week2) using sapply I have to use the next 24
>>> values from 25 to 48:
>>> 
>>> sapply(list_of_dataframes,"[",25:48)
>>> 
>>> 
>>> It seems that sapply treats the data frames in the list as vectors. I
>>> can of course extract all consecutive weeks using consecutive blocks of
>>> 24 values, but this seems cumbersome.
>>> 
>>> 
>>> The question remains, how to select specific columns from data frames in
>>> a list, e.g. all columns 3 of all data frames in the list.
>>> 
>>> 
>>> Reformatting (unlist(), dim()) in one data frame with one column for
>>> each week does not help, since I'm not calculating colMeans etc, but
>>> row-wise calculations using apply(x,1,FUN) ("applying a function to
>>> margins of an array or matrix").
>>> 
>>> 
>>> 
>>> Thanks for you help and suggestions!
>>> 
>>> 
>>> Wolfgang
>>> 
>>> 
>>> 
>>> On 08/02/16 18:00, Dénes Tóth wrote:
>>>> Hi,
>>>> 
>>>> Although you did not provide any reproducible example, it seems you
>>>> store the same type of values in your data.frames. If this is true, it
>>>> is much more efficient to store your data in an array:
>>>> 
>>>> mylist <- list(a = data.frame(week1 = rnorm(24), week2 = rnorm(24)),
>>>>              b = data.frame(week1 = rnorm(24), week2 = rnorm(24)))
>>>> 
>>>> myarray <- unlist(mylist, use.names = FALSE)
>>>> dim(myarray) <- c(nrow(mylist$a), ncol(mylist$a), length(mylist))
>>>> dimnames(myarray) <- list(hour = rownames(mylist$a),
>>>>                         week = colnames(mylist$a),
>>>>                         other = names(mylist))
>>>> # now you can do:
>>>> mean(myarray[, "week1", "a"])
>>>> 
>>>> # or:
>>>> colMeans(myarray)
>>>> 
>>>> 
>>>> Cheers,
>>>> Denes
>>>> 
>>>> 
>>>> On 02/08/2016 02:33 PM, Wolfgang Waser wrote:
>>>>> Hello,
>>>>> 
>>>>> I have a list of 7 data frames, each data frame having 24 rows (hour of
>>>>> the day) and 5 columns (weeks) with a total of 5 x 24 values
>>>>> 
>>>>> I would like to combine all 7 columns of week 1 (and 2 ...) in a
>>>>> separate data frame for hourly calculations, e.g.
>>>>>> apply(new.data.frame,1,mean)
>>>>> 
>>>>> In some way sapply (lapply) works, but I cannot directly select columns
>>>>> of the original data frames in the list. As a workaround I have to
>>>>> select a range of values:
>>>>> 
>>>>>> sapply(list_of_dataframes,"[",1:24)
>>>>> 
>>>>> Values 1:24 give the first column, 25:48 the second and so on.
>>>>> 
>>>>> Is there an easier / more direct way to select for specific columns
>>>>> instead of selecting a range of values, avoiding loops?
>>>>> 
>>>>> 
>>>>> Cheers,
>>>>> 
>>>>> Wolfgang
>>>>> 
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com