[R] flexible approach to subsetting data

David Carlson dcarlson at tamu.edu
Tue Jul 23 23:00:40 CEST 2013


Actually the ".0" on the first variable is not needed.

You could modify the reshape() call to search for the base
name of each variable so you would not need to change the code
if the number of replications changes:

reshape(df5,  direction="long", v.names=c("dose", "resp"), 
	varying=list(dose=grepl("dose", names(df5)),
	resp=grepl("resp", names(df5)) )
      )

-------------------------------------
David L Carlson
Associate Professor of Anthropology
Texas A&M University
College Station, TX 77840-4352

-----Original Message-----
From: r-help-bounces at r-project.org
[mailto:r-help-bounces at r-project.org] On Behalf Of David
Winsemius
Sent: Tuesday, July 23, 2013 1:12 PM
To: David Winsemius
Cc: R help; Andrea Lamont
Subject: Re: [R] flexible approach to subsetting data


On Jul 23, 2013, at 10:49 AM, David Winsemius wrote:

> 
> On Jul 23, 2013, at 10:01 AM, Adams, Jean wrote:
> 
>> Check out the reshape() function of the reshape package.
Here's one of the
>> examples from ?reshape.
>> 
>> Jean
>> 
>> 
>> library(reshape)   # No,  at least not for the
reshape-function
> 
> The reshape function is from the 'base' package. The
'reshape' and 'reshape2' packages were written (at least in
part) because the 'reshape'-function was so difficult to
understand.
> 
> If you do choose to use the reshape2 package, which is
well-respected and often extremely helpful, the function you
will want to start with is 'melt'.
> 
> 
>> long <- reshape(wide, direction="long")
> 
> I don't think this example will be particularly helpful
since the initial direction is "long" (from "wide") and more
input would be needed.

Here's a dataset to experiment with

df5 <- data.frame(dose.0 =
c(40,50,60,50),resp.0=c(40,50,60,50), 
 dose.1 = c(1,2,1,2), resp.1=c(1,2,1,2)+3, 
 dose.2 = c(2,1,2,1), resp.2=c(1,2,1,2)+3,
 dose.3 = c(3,3,3,3), resp.3=c(1,2,1,2)+3 )

Notice that you would need add the ".0" to the column names

reshape(df5,  direction="long", 
              v.names=c("dose", "resp"), 
               varying=list(dose=c(1,3,5,7), resp=c(2,4,6,8) )
        )  # succeeds



So perhaps could use similar call (after append the ".0"'s)
with:

  varying=list(sim=seq(1,810,by=4),
               X1= seq(2,810,by=4),
               X2= seq(3,810,by=4),
               X3= seq(4,810,by=4)
               )
               
> 
> 
>> wide
>> long
>> 
>> 
>> 
>> On Tue, Jul 23, 2013 at 9:35 AM, Andrea Lamont
<alamont082 at gmail.com> wrote:
>> 
>>> Hello:
>>> 
>>> I am running a simulation study and am stuck with a
subsetting problem.
>>> 
>>> Here is the basic issue:
>>> I generated data and am running a simulation that uses
multiple imputation.
>>> For each generated dataset, I used multiple imputation.
The resultant
>>> dataset is in wide for where each imputation is recorded
as a separate
>>> column (though the different simulations are stacked).
Here is an example
>>> of what it looks like:
>>> 
>>> sim   X1   X2   X3   sim.1   X1.1    X1.1    X3.1
> 
>>> 1         #    #     #        #           #          #
#
>>> 1         #    #     #        #           #          #
#
>>> 1         #    #     #        #           #          #
#
>>> 2         #    #     #        #           #          #
#
>>> 2         #    #     #        #           #          #
#
>>> 2         #    #     #        #           #          #
#
>>> 
>>> sim refers to the simulated/generated dataset. X1-X3 are
the values for the
>>> first imputed dataset, X1.1-X3.1 are the values for the
second imputed
>>> dataset.
>>> 
>>> The problem is that I want the data to be in long format,
like this:
>>> 
>>> sim m X1 X2 X3
>>> 1  1   #   #    #
>>> 1  2   #   #    #
>>> 2  1   #   #    #
>>> 2  2   #   #    #
>>> 
>>> where m is the imputation number.
>>> This will allow me to do cleaner calculations (e.g.
X3-X1).
>>> 
>>> I know I can subset the data manually - e.g. [,1:10] and
save this to
>>> separate datasets then  rbind; however, I'm looking for a
more flexible
>>> approach to do this.  This manual approach would be quite
tedious as number
>>> of imputations (and therefore number of columns) increased
(with only 10
>>> imputations, there are roughly 810 columns). Also,I would
like to
>>> avoid having to recode each time I change the number of
imputations.
>>> 
>>> THe same is true for the reshape function, which would
require naming
>>> a huge number of columns and edits each time 'm' changes.
> 
> If the columns are named regularly, then 'reshape' will
attempt to split properly without an explicit naming. Details
and a better description of the problem might allow more
specific answers to emerge. The fact that the first instances
have no numeric indicators may be a problem for the algorithm.

> 
> Why not post dput(head( dfrm[ ,1:12]))
> 
> -- 
> David.
> 
>>> 
>>> 
>>> Is there a flexible way to approach this? I'm inclined to
use a for loop,
>>> but know that 1) this is generally inefficient and 2) am
having trouble
>>> with
>>> the coding regardless.
>>> 
>>> Any suggestions are appreciated.
>>> 
>>> Thanks,
>>> Andrea
>>> 


David Winsemius
Alameda, CA, USA

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible
code.



More information about the R-help mailing list