[R] baseline fitters

Tue Feb 20 20:15:16 CET 2007

I am not surprised at slowness of runquantile, since it is trying to
perform n=4500 partial sorts of k=225 elements. Here are some thoughts
at speeding it up:
1) playing with different endrule settings can save some time, but
usually results with undesirable effects at first and last 112 values.
All rum* functions in caTools use low level C code for inner elements
between k/2 and n-k/2. However the elements at the edge are calculated
using R functions. In case of runquantile with endrule="func" that means
k calls of R quantile function. One option for endrule not available at
present would be to pad both sides of the array with k/2 numbers and
than use endrule="trim". The trick would be to pick good value for the
padding number.

2) you mentioned that you "jimmied something together
with runmin and runmedian". I would try something like runmean with
window of size 5, 15, 25 and than runmin with window size k. The first
one should get rid of your 'reverse-spikes' and second would take
running min of your smoothed function.

Best,
Jarek Tuszynski

-----Original Message-----
From: Thaden, John J [mailto:ThadenJohnJ at uams.edu] 
Sent: Tuesday, February 20, 2007 1:23 PM
To: r-help at stat.math.ethz.ch
Cc: Tuszynski, Jaroslaw W.
Subject: baseline fitters

I am pretty pleased with baselines I fit to chromatograms using the
runquantile() function in caTools(v1.6) when its probs parameter is 
set to 0.2 and its k parameter to ~1/20th of n (e.g., k ~ 225 for n ~ 
4500, where n is time series length).  This ignores occasional low-
side outliers, and, after baseline subtraction, I can re-adjust any
negative values to zero.

But runquantile's computation time proves exceedingly long for my large
datasets, particularly if I set the endrule parameter to 'func'.  Here
is
what caTools author Jarek Tuszynski says about relative speeds of
various
running-window functions:

   - runmin, runmax, runmean run at O(n) 
   - runmean(..., alg="exact") can have worst case speed of O(n^2) for 
     some small data vectors, but average case is still close to O(n). 
   - runquantile and runmad run at O(n*k) 
   - runmed - related R function runs at O(n*log(k))

The obvious alternative runmin() performs poorly due to dropout (zero-
and low-value 'reverse-spikes') in the data. And runmed fits a baseline
that,
upon subtraction, obviously will send half the values into the negative,
not
suitable for my application. I jimmied something together
with runmin and runmedian that is considerably faster; unfortunately,
the fit seems less good, at least by eye, due still to the bad runmin
behavior.

I'd be interested in other baseline fitting suggestions implemented
or implementable in R (I'm using v. 2.4.1pat under WinXP).  Why, for
instance, can I not find a running trimmean function? Because it 
offers no computational savings over runquantile?

Also, the 'func' setting for the caTools endrule parameter--which
adjusts the
value of k as ends are approached--is said not to be optimized (using
this option does add further time to computations).  Is there an alter-
native that would be speedier, e.g., setting endrule = "keep" and then
subsequently treating ends with R's smoothEnds() function?

-John Thaden
Little Rock, Arkansas USA

Confidentiality Notice: This e-mail message, including any a...{{dropped}}