[R] pairs

J C jeanhee.chung at yale.edu
Sat Jan 11 06:33:03 CET 2003

>1) I suggest you try a postscript() device, and convert later if you need
>to.  Expect a very large file size.

Dear Dr. Ripley,

Thank you!  Postscript was able to finish the job (bitmap killed itself.)
The filesizes are indeed large: 1.4G and requiring over two hours to
display by gv, but ultimately viewable.  I'm new to manipulating ps files
but hopefully I can find a fast way to convert the files into a small
format. I found an archived message of yours that suggested not to use
pch="." as a symbol  for graphing large datasets, and upon experimentation
I found that the default symbol, pch=21, seemed to produce the smallest
files for some sets of test data when compared with some other symbols.
Running "pch=21, cex=0.35" produced a fairly small point but consumed much
less space than pch="."   Is this the best solution for producing plot
symbols that take up little room both on the plot and the hard drive?

>Sounds like the problem is in your X server and not in R.  I've seen this
>with Xfree (and don't use that myself on Linux).
It's possible... however, I wouldn't know how to fix it from that end,

>2) Don't plot all the points. You say you have a `very large dataset'. In 
>statistics, we give numbers, not vague descriptions. However, with what 
>that means to me (many millions of rows) a scatterplot of a very large 
>dataset is going to be mainly black at least in places. (We've 
>experienced that with 1.4 million points, for example.) That's not a good 
>way to display the data. Either use a density plot, or if you are 
>interested in outliers, thin the centre. We did this by estimating a 
>density phat, then randomly selecting points with probability min(1, 
>const/phat(x)) for a suitable `const'

I have a set of textfiles, each containing a  450,000 x 41 matrix (1.845
million datapoints)  and roughly 300M. Indeed, the scatterplots are
overprinted, but I am interested in getting a "feel" for the data before
charging ahead. The data (measurements on artificial phylogenetic trees)
were produced by simulation and although I have been running checks all
along I wanted to make sure that my simulations weren't producing any
strange outliers or oddly shaped distributions. On the other hand, I had no
real guess as to what the data would look like or even what variables would
show strong correlations. Since many of these datapoints are from repeats,
I was in fact able to discern a lot of pattern, rather than getting
all-black plots.   

Using both a density plot and a thinned plot may be the way to go, if I
don't find a way to shrink down the graphs.  I hoped that "pairs" would be
a fast, one-line way to take in all my data at once, but of course nothing
has been that easy with all this data. 


More information about the R-help mailing list