[Rd] reshape scaling with large numbers of times/rows

Thu Aug 24 14:46:37 CEST 2006

I'd like to thank everyone that's replied so far--more inline:

On Thu, 2006-08-24 at 11:16 +0100, Prof Brian Ripley wrote:
> Your example does not correspond to your description.  You have taken a 
> random number of loci for each subject and measured each a random number 
> of times:

You're right.  I was trying to come up with an example that didn't
require sending out a big hunk of data.  The overall number of
rows/columns and the data types/sizes in the example were true to life
but the relationship between columns was not.  Also, in my testing the
run time of the random example was pretty close to (actually faster
than) the run time on my real data.

In the real data, there's about one row per subject/locus pair (some
combinations are missing).  The genotype data does have character type;
I'd have to think a bit to see if I could make it into an integer
vector.  Aside from just making it a factor, of course.

Thanks to Gabor Grothendieck for demonstrating gl():

> betterTest=data.frame(subject=as.character(1:70),
locus=as.character(gl(4500, 70)),
genotype=as.character(as.integer(runif(4500*70, 1, 20))))
> sapply(betterTest, is.factor)
  subject    locus genotype
    TRUE     TRUE     TRUE
> system.time(wideTest <- reshape(betterTest, v.names="genotype",
timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE)
[1] 1356.209  178.867 2071.640    0.000    0.000
> dim(wideTest)
[1]   70 4501
> dim(betterTest)
[1] 315000      3

This was on a different machine (a 2.2 Ghz Athlon 64).  The only
difference I can think of between betterTest and my actual data is that
betterTest is ordered.

> Also, subject and locus are archetypal factors, and forcing them to be 
> character vectors is just making efficiency problems for yourself.

Hmmmm, that's the way they're coming out of the database.  I'm using
RdbiPgSQL from Bioconductor, and I assumed there was a reason why the
database interface wasn't turning things into factors.  Given my (low)
level of R knowledge, I'd have to think for a while to convince myself
that doing so wouldn't make a difference aside from being faster.  Of
course, if you're asserting that that's the case I'll take your word for
it.

> I have an R-level solution that takes 0.2 s on my machine, and involves no 
> changes to R.
> 
> However, you did not give your affiliation and I do not like giving free 
> consultancy to undisclosed commercial organizations.  Please in future use 
> a proper signature block so that helpers are aware of your provenance.

Ah, I hadn't really thought about this, but I see where you're coming
from.  I work here (my name and this email address are on the page):
http://egcrc.org/pis/white-c.htm
Please forgive my r-devel-newbieness; this is less of an issue on the
other mailing lists I follow.

When there's a chance (however slim, in this case) that something I
write will end up getting used by someone else, I usually use my
personal email address and general identity, because I know it'll follow
me if I change jobs.  The concern, of course, being that someone using
it will want to get in touch with me sometime in the far future.  I
don't exactly have a tenured position.

I really am trying to give at least as much as I'm taking; hopefully my
first email shows that I did a healthy bit of
thinking/reading/googling/coding before posting (maybe too much).
Apparently the c-solution isn't necessary, but doing this in 0.2s is
pretty amazing.  On the same size data frame?

Thanks,
Mitch Skinner                            Tel: 510-985-3192
Programmer/Analyst
Ernest Gallo Clinic & Research Center
University of California, San Francisco