[R] Accessing data in groups created with split() and other beginner questions

Clay Heaton ccheaton at gmail.com
Mon Mar 22 14:27:55 CET 2010


Hi, very new to R here...

I have a data frame called 'set' with 100k+ rows in it that looks like this:

  subject           timestamp  yvalue traceabs subjtrace
1       1 1992-07-12 06:05:00      12        1       1-1
2       1 1992-07-12 06:10:00      15        1       1-1
3       1 1992-07-12 06:15:00      17        1       1-1
4       1 1992-07-12 06:20:00      20        1       1-1
5       1 1992-07-12 06:25:00      24        1       1-1
....

There are 89 subjects, each of which have a different number of traces -- it's time series data. There are, in total, around 180 traces. The "subjtrace" variable is just a concatenation of the subject number, a hyphen, and the relative trace number. For instance, the first trace for subject 46 is "46-1" but the traceabs value for the same trace is 71.

I need to perform simple statistics on each subject and on each trace. I also need to graph each trace.

It seems like the easy approach to identifying the variables would be to use the split() function to create groups:

> temp <- split(set, set$subject)

When I then try, for example:

> summary(temp[1])

all I get as a result is:
  Length Class      Mode
1 5      data.frame list

So I went with:

> lapply(temp[1], summary)

That works, but I'm unable to do something like:

> lapply(temp[1]$yvalue, mean)

because the result returned is:
list()

Ultimately, I'm trying to run the exact same code on each group, as defined by the subject number, and each trace. I would like to display something like the following:

Subject # and Summary Statistics
-- Graph of a trace belonging to the subject
-- Summary statistics for the trace
-- Graph of the next trace belonging to the subject
-- Summary statistics for the trace
-- etc...

My intention is to dump this all into a .pdf file with Sweave and LaTeX.

Questions:
- Is split() the best function to use to create the proper groups? or should I look to create a separate variable for each group using subset, like:
temp.46 <- subset(set, subject==46,select=c(subject, timestamp, yvalue, subjtrace))

- How do I call functions on data within the groups created by split()? Like...
lapply(temp[1]$yvalue, sd)

- In an effort to try to learn the proper way to approach this, what would be the best practice for iterating through the data and pushing it to .pdf?

Thanks! 


More information about the R-help mailing list