[R] compressing/reducing data for plot

Timo Schneider timos at perlplexity.org
Mon Oct 17 16:31:28 CEST 2011


Hello,

I have simulation results in the form of

  Time                  V                   I                
  0.000000000000e+000  7.218354344368e-001  5.224478627497e-006  
  1.000000000000e-009  7.218354344368e-001  5.224477718002e-006  
  2.000000000000e-009  7.218354344368e-001  5.224477718002e-006  
  4.000108361244e-009  7.218354344368e-001  5.224478627497e-006  
  8.000325083733e-009  7.218354344368e-001  5.224478627497e-006

as the timesteps are small, each simulation results in a lot of data,
about 1e5 data points per simulation.

Now I want to plot this data. If I do this with a simple

plot(x=data$Time, y=data$V, type="l")

the resulting file (I plot into postscript files) is huge and takes a
long time to render, since R creates a new line segment for each
timestep. Of course it makes no sense to plot more than a few hundred
datapoints in a single plot. However, I don't have a good idea how to
remove the "uninteresting" part of the data, i.e., the datapoints that
lie very close to the lines that would be drawn by R anyway if there
were no datapoint for that time value.

Since the values in my simulation are constant most of the time but
sometimes have interesting "spikes" a simple 

data <- data[seq(1:length(data),1000),]

to only plot every 1000th point does not work for me as it could remove
some "spikes" completely or lead to aliasing problems.

Is there any standard way to do this in R?


The best thing I came up with so far is a function that judges if a row
in the dataframe should be kept for plotting based on each points
difference to its predecessor. However, this function has two problems:

* It is very slow! (Takes about 4 seconds for each 1e5 element
dataframe)

* It does not work well if the values increase/decrease monotonically
with small values - it will remove them all since the difference between
each point and its predecessor is minimal

I included my own function below:

=== cut ===

get_significant_rows_1 <- function (data, threshold) {

# get the difference between each datapoint and the following datapoint
# of course this list is one shorter than the input dataset, which does
# not matter since the first and last datapoint will always be included
diffs = abs(data[1:nrow(data)-1,] - data[2:nrow(data),]);

# normalize the differences according to the value range in their column
col.range = apply(data,2, function(d) {abs(max(d) - min(d))});
normalized_diffs <- t(apply(diffs, 1, function(d) {d/col.range}));
rm("col.range");
# get the "biggest difference" in each row
biggest_difference <- as.vector(apply(normalized_diffs,1, max));

# check if the "biggest difference" is above the threshold - 
# that means the row is "significant" in a plot
signif <- biggest_difference >= threshold;
rm("biggest_difference");
# the last datapoint/row is always significant, otherwise the plot could become "shorter"
signif[length(signif)] = TRUE;

# also the first one - we are adding a TRUE in front of the signif vector
# now, since it does not include a value for this because the first value
# naturally doesn't have a predecessor, so there was no entry for it in 
# the diffs array
signif <- append(signif, TRUE, 0);

# if a point is significant in a plot, the point before that is also "important",
# at least for line plots, otherwise we get angled lines where flat ones should be
signif <- (signif | append(signif[2:length(signif)], FALSE));

return(data[signif,]);

}

#example application (makes no sense for this kind of data though)

data <- data.frame(a=rnorm(10000), b=rnorm(10000));
#                    dataset, threshold
get_significant_rows_1(data, 0.01)

==== here ====

Thank you for any helpful advice or comments. :-)

Regards,
Timo



More information about the R-help mailing list