[R] OK - I got the data - now what? :-)

Wed Jul 8 20:13:16 CEST 2009

On Wed, Jul 8, 2009 at 10:51 AM, Michael A. Miller<mmiller3 at iupui.edu> wrote:
>>>>>> Mark wrote:
>
>    > Currently my data is one experiment per row, but that's
>    > wasting space as most experiments only take 20% of the row
>    > and 80% of the row is filled with 0's. I might want to make
>    > the array more narrow and have a flag somewhere in the 1st
>    > 10 columns that says the this row is a continuation row
>    > from the previous row. That way I could pack the array
>    > better, use less memory and when I do finally test for 0 I
>    > have a short line to traverse?
>
> This may be a bit off track from the data manipulation you are
> working on, but I thought I'd point out that another way to
> handle this sort of data is to make a table with one measurement
> per row, rather than one experiment per row.
>
> experiment measurement value
>         A           1  0.27
>         A           2  0.66
>         A           3  0.24
>         A           4  0.55
>         B           1  0.13
>         B           2  0.65
>         B           3  0.83
>         B           4  0.41
>         B           5  0.92
>         B           6  0.67
>         C           1  0.75
>         C           2  0.97
>         C           3  0.49
>         C           4  0.58
>         D           1  1.00
>         D           2  0.71
>         E           1  0.11
>         E           2  0.50
>         E           3  0.98
>         E           4  0.07
>         E           5  0.94
>         E           6  0.57
>         E           7  0.34
>         E           8  0.21
>
>
> If you wrote the output of your calculations in this way, one
> value per line, it can easily be read into R as a data.frame and
> handled with less need for munging.  No need to remove the
> zero-padding because the zeros aren't needed in the first place.
>
> You can subset the data with subset, as in
>
>  test <- read.table('test.dat',header=TRUE)
>  expA <- subset(test, experiment=='A')
>  expB <- subset(test, experiment=='B')
>
> so there is no need to deal with ragged/zero-padded arrays. Your
> plots can be grouped automatically with lattice:
>
> require(lattice)
> xyplot(value ~ measurement, data=test, group=experiment, type='b')
> xyplot(value ~ measurement | experiment, data=test, type='b')
>
>
> It is simple to do calculations by experiment using tapply.  For
> example
>
>
>> with(test, tapply(value, experiment, mean))
>        A         B         C         D         E
> 0.4300000 0.6016667 0.6975000 0.8550000 0.4650000
>
>
>> with(test, tapply(measurement, experiment, max))
> A B C D E
> 4 6 4 2 8
>
>
>
> Mike
>

Mike,
   It's not really that far off track as I didn't have any background
when I started this in R. This is the first time I've used it. I
simply chose to use a format that I thought would work for me in both
Excel and R. I do like your examples.

   My impression of reshape coupled with cast is that it's pretty
capable of giving me more or less the same format you suggest although
it is a bit of work. Currently in my files I save only the start and
finish times of the experiments and planned on calculating all the
times in the middle if necessary. With this format I'd just write them
out on each line and save that work in R.

   I suppose the files using this alternative format would be a lot
larger on disk. I currently have 10 values + 500 observations per
experiment with an average experiment tracking file containing maybe
500-1000 experiments. With this format in the worst I suppose I'd have
(10+1) * 1000 per experiment on disk, but on average it would be less
than that because as you say I wouldn't write out any zeros. Once in R
in memory they'd be equivalent. Disk space doesn't matter but reading
and writing the files might be slower. I suppose I don't really have
to write the zeros out anyway, but at this point it's jsut one
additional subset after going through reshape.

   It might be an advantage to get to the subset commands immediately
but still I've got 10 independent variables and I suspect I'm going to
be using reshape/cast more than once to get to my answers so I haven't
been against learning how to work with it.

   Overall they are good inputs and I appreciate them. Thanks!

Cheers,
Mark