[R] approxfun-problems (yleft and yright ignored)

Greg Snow Greg.Snow at imail.org
Thu Aug 26 23:27:27 CEST 2010


OK, I think that I figured out what is going on.  You have some of your x values that are very close to each other in value, but not exactly the same.  If we look at how many unique x values you have we get:

> length(unique(approx.data$x))
[1] 901

But inside the approxfun function the tapply function is used to collapse the x values and tapply coerces your x variable to a factor (which first coerces to character, but not to the same precision as unique uses):

> length(levels(as.factor(approx.data$x)))
[1] 893

And this is number of values that the y vector is reduced to, so when R passes your x and y to the compiled C function it gets a y vector of length 893, but expects one of length 901, so there are 8 positions beyond the end of the vector that it is trying to use, but no real data is in those positions, what is is apparently somewhat random, so when you try to predict near the right most extreme of your data (but no beyond), the results are not predictable.

The work around is to pre clean your data to have unique x values so that R does not run into this problem. Another option would be to round the x variables to fewer decimal places so that the conversion to factor matches the unique values.

This should still be reported as a bug, but the R code could be fixed rather than searching the compiled C code.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: Samuel Wuest [mailto:wuests at tcd.ie]
> Sent: Thursday, August 26, 2010 7:34 AM
> To: Greg Snow
> Cc: r-help at r-project.org
> Subject: Re: [R] approxfun-problems (yleft and yright ignored)
> 
> Hi Greg,
> thanks for the suggestion:
> 
> I have attached some small dataset that can be used to reproduce the
> odd behavior of the approxfun-function.
> 
> If it gets stripped off my email, it can also be downloaded at:
> http://bioinf.gen.tcd.ie/approx.data.Rdata
> 
> Strangely, the problem seems specific to the data structure in my
> expression set, when I use simulated data, everything worked fine.
> 
> Here is some code that I run and resulted in the strange output that I
> have described in my initial post:
> 
> > ### load the data: a list called approx.data
> > load(file="approx.data.Rdata")
> > ### contains the slots "x", "y", "input"
> > names(approx.data)
> [1] "x"     "y"     "input"
> > ### with y ranging between 0 and 1
> > range(approx.data$y)
> [1] 0 1
> > ### compare ranges of x and input-x values (the latter is a small
> subset of 500 data points):
> > range(approx.data$x)
> [1] 3.098444 7.268812
> > range(approx.data$input)
> [1]  3.329408 13.026700
> >
> >
> > ### generate the interpolation function (warning message benign)
> > interp <- approxfun(approx.data$x, approx.data$y, yleft=1, yright=0,
> rule=2)
> Warning message:
> In approxfun(approx.data$x, approx.data$y, yleft = 1, yright = 0,  :
>   collapsing to unique 'x' values
> >
> > ### apply to input-values
> > y.out <- sapply(approx.data$input, interp)
> >
> > ### still I find output values >1, even though yleft=1:
> > range(y.out)
> [1] 0.000000 7.207233
> > hist(y.out)
> >
> > ### and the input-data points for which strange interpolation does
> occur have no unusual distribution (however, they lie close to max(x)):
> > hist(approx.data$input[which(y.out>1)])
> 
> The session info can be found below, thanks a million for any help.
> 
> Sam
> 
> On 25 August 2010 19:31, Greg Snow <Greg.Snow at imail.org> wrote:
> > The plots did not come through, see the posting guide for which
> attachments are allowed.  It will be easier for us to help if you can
> send reproducible code (we can copy and paste to run, then examine,
> edit, etc.).  Try finding a subset of your data for which the problem
> still occurs, then send the data if possible, or similar simulated data
> if you cannot send original data.
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at imail.org
> > 801.408.8111
> >
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> project.org] On Behalf Of Samuel Wuest
> >> Sent: Wednesday, August 25, 2010 8:20 AM
> >> To: r-help at r-project.org
> >> Subject: [R] approxfun-problems (yleft and yright ignored)
> >>
> >> Dear all,
> >>
> >> I have run into a problem when running some code implemented in the
> >> Bioconductor panp-package (applied to my own expression data),
> whereby
> >> gene
> >> expression values of known true negative probesets (x) are
> interpolated
> >> onto
> >> present/absent p-values (y) between 0 and 1 using the *approxfun -
> >> function*{stats}; when I have used R version 2.8, everything had
> >> worked fine,
> >> however, after updating to R 2.11.1., I got unexpected output
> >> (explained
> >> below).
> >>
> >> Please correct me here, but as far as I understand, the yleft and
> >> yright
> >> arguments set the extreme values of the interpolated y-values in
> case
> >> the
> >> input x-values (on whose approxfun is applied) fall outside
> range(x).
> >> So if
> >> I run approxfun with yleft=1 and yright=0 with y-values between 0
> and
> >> 1,
> >> then I should never get any values higher than 1. However, this is
> not
> >> the
> >> case, as this code-example illustrates:
> >>
> >> > ### define the x-values used to construct the approxfun, basically
> >> these
> >> are 2000 expression values ranging from ~ 3 to 7:
> >> > xNeg <- NegExprs[, 1]
> >> > xNeg <- sort(xNeg, decreasing = TRUE)
> >> >
> >> > ### generate 2000 y-values between 0 and 1:
> >> > yNeg <- seq(0, 1, 1/(length(xNeg) - 1))
> >> > ### define yleft and yright as well as the rule to clarify what
> >> should
> >> happen if input x-values lie outside range(x):
> >> > interp <- approxfun(xNeg, yNeg, yleft = 1, yright = 0, rule=2)
> >> Warning message:
> >> In approxfun(xNeg, yNeg, yleft = 1, yright = 0, rule = 2) :
> >>   collapsing to unique 'x' values
> >> > ### apply the approxfun to expression data that range from ~2.9 to
> >> 13.9
> >> and can therefore lie outside range(xNeg):
> >> >  PV <- sapply(AllExprs[, 1], interp)
> >> > range(PV)
> >> [1]    0.000 6208.932
> >> > summary(PV)
> >>      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
> >> 0.000e+00 0.000e+00 2.774e-03 1.299e+00 3.164e-01 6.209e+03
> >>
> >> So the resulting output PV object contains data ranging from 0 to
> 6208,
> >> the
> >> latter of which lies outside yleft and is not anywhere close to
> extreme
> >> y-values that were used to set up the interp-function. This seems
> wrong
> >> to
> >> me, and from what I understand, yleft and yright are simply ignored?
> >>
> >> I have attached a few histograms that visualize the data
> distributions
> >> of
> >> the objects I xNeg, yNeg, AllExprs[,1] (== input x-values) and PV
> (the
> >> output), so that it is easier to make sense of the data
> structures...
> >>
> >> Does anyone have an explanation for this or can tell me how to fix
> the
> >> problem?
> >>
> >> Thanks a million for any help, best, Sam
> >>
> >> > sessionInfo()
> >> R version 2.11.1 (2010-05-31)
> >> x86_64-apple-darwin9.8.0
> >>
> >> locale:
> >> [1] en_IE.UTF-8/en_IE.UTF-8/C/C/en_IE.UTF-8/en_IE.UTF-8
> >>
> >> attached base packages:
> >> [1] stats     graphics  grDevices utils     datasets  methods   base
> >>
> >> other attached packages:
> >> [1] panp_1.18.0   affy_1.26.1   Biobase_2.8.0
> >>
> >> loaded via a namespace (and not attached):
> >> [1] affyio_1.16.0         preprocessCore_1.10.0
> >>
> >>
> >> --
> >> -----------------------------------------------------
> >> Samuel Wuest
> >> Smurfit Institute of Genetics
> >> Trinity College Dublin
> >> Dublin 2, Ireland
> >> Phone: +353-1-896 2444
> >> Web: http://www.tcd.ie/Genetics/wellmer-2/index.html
> >> Email: wuests at tcd.ie
> >> ------------------------------------------------------
> >
> >
> 
> 
> 
> --
> -----------------------------------------------------
> Samuel Wuest
> Smurfit Institute of Genetics
> Trinity College Dublin
> Dublin 2, Ireland
> Phone: +353-1-896 2444
> Web: http://www.tcd.ie/Genetics/wellmer-2/index.html
> Email: wuests at tcd.ie
> ------------------------------------------------------



More information about the R-help mailing list