%% $Id: np_faq.tex,v 1.99 2018/03/16 14:22:41 jracine Exp jracine $ %\VignetteIndexEntry{Frequently Asked Questions (np)} %\VignetteDepends{np} %\VignetteKeywords{nonparametric, kernel, categorical} %\VignettePackage{np} \documentclass[12pt]{amsart} \tolerance=5000 \usepackage{setspace,hyperref} \newcommand{\field}[1]{\mathbb{#1}} \newcommand{\R}{\field{R}} %% Change the default page sizes. \setlength{\topmargin}{-0.25in} \setlength{\textheight}{8.5in} \setlength{\oddsidemargin}{.0in} \setlength{\evensidemargin}{.0in} \setlength{\textwidth}{6.5in} \setlength{\footskip}{.5in} \title{Package \texttt{np} FAQ} \date{\today} \author{Jeffrey S.~Racine} \begin{document} \maketitle \tableofcontents \onehalfspacing \section{Overview and Current Version} This set of frequently asked questions is intended to help users who are encountering unexpected or undesired behavior when trying to use the \texttt{np} package. Many of the underlying C routines have been extensively tested over the past two decades. However, the \texttt{R} 'hooks' needed to call these routines along with processing of the data required for a seamless user experience may produce unexpected or undesired results in some settings. If you encounter any issues with the \texttt{np} package, kindly first ensure that you have the most recent version of \texttt{np}, \texttt{R}, and \texttt{RStudio} (if appropriate) installed. Sometimes issues encountered using outdated versions of software have been resolved in the current versions, so this is the first thing one ought to investigate when the unexpected occurs. Having ensured that the problem persists with the most recently available versions of all software involved, kindly report any issues you encounter to me, and \textsl{please include} your code, data, version of the \texttt{np} package, version of \texttt{R}, and operating system (with version number) used so that I can help track down any such issues (\href{mailto:racinej@mcmaster.ca}{racinej@mcmaster.ca}). And, of course, if you encounter an issue that you think might be of interest to others, kindly email me the relevant information and I will incorporate it into this FAQ. This FAQ refers to the most recent version, which as of this writing is 0.60-18. Kindly update your version should you not be using the most current (from within R, \texttt{update.packages()} ought to do it, though also see \ref{update} below.). See the appendix in this file for cumulative changes between this and previous versions of the \texttt{np} package. \section{Frequently Asked Questions} \subsection{How do I cite the \texttt{np} package?} Once you have installed the \texttt{np} package (\texttt{install.packages("np")}), if you load the \texttt{np} package (\texttt{library("np")}) and type \texttt{citation("np")} you will be presented with the following information. \begin{singlespacing} \begin{verbatim} > citation("np") To cite np in publications use: Tristen Hayfield and Jeffrey S. Racine (2008). Nonparametric Econometrics: The np Package. Journal of Statistical Software 27(5). DOI 10.18637/jss.v027.i05 A BibTeX entry for LaTeX users is @Article{, title = {Nonparametric Econometrics: The np Package}, author = {Tristen Hayfield and Jeffrey S. Racine}, journal = {Journal of Statistical Software}, year = {2008}, volume = {27}, number = {5}, url = {https://www.jstatsoft.org/index.php/jss/article/view/v027i05}, doi = {10.18637/jss.v027.i05}, pages = {1-32}, } \end{verbatim} \end{singlespacing} \subsection{I have never used \texttt{R} before. Can you direct me to some introductory material that will guide me through the basics?} There are many excellent introductions to the \texttt{R} environment with more on the way. First, I would recommend going directly to the \texttt{R} website (\url{http://www.r-project.org}) and looking under Documentation/Manuals (\url{http://cran.r-project.org/manuals.html}) where you will discover a wealth of documentation for \texttt{R} users of all levels. See also the \texttt{R} task views summary page (\url{http://cran.r-project.org/web/views}) for information grouped under field of interest. A few documents that I mention to my students which are tailored to econometricians include \url{http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf}, Cribari-Neto \& Zarkos (1999) \cite{CRIBARI_NETO_ZARKOS:1999}, Racine \& Hyndman (2002) \cite{RACINE_HYNDMAN:2002} and Farnsworth (2006) \cite{FARNSWORTH:2006}, to name but a few. Those looking for exemplar data sets outside of those contained in the \texttt{np} package are directed to the \texttt{Ecdat} \cite{Ecdat} and \texttt{AER} \cite{AER} packages. I maintain a 'Gallery' to provide a forum for users to share code and discover examples and illustrations which can be found at \url{http://socserv.mcmaster.ca/racinej/Gallery/Home.html}. Often the best resource is right down the hall. Ask a colleague whether they use or know anyone who uses R, then offer to buy that person a coffee and along the way drop something like ``I keep hearing about the \texttt{R} project\dots I feel like such a Luddite\dots'' \subsection{\label{update}How do I keep all \texttt{R} packages on my system current?} Run the command \texttt{update.packages(checkBuilt=TRUE,ask=FALSE)}, which will not only update all packages that are no longer current, but will also update all packages built under outdated installed versions of R, if appropriate. \subsection{It seems that there are a lot of packages that must be installed in order to conduct econometric analysis (tseries, lmtest, np, etc.). Is there a way to avoid having to individually install each package individually?} Certainly. The Comprehensive R Archive Network (CRAN) is a network of ftp and web servers around the world that store identical, up-to-date, versions of code and documentation for \texttt{R}. The CRAN 'task view' for computational econometrics might be of particular interest to econometricians. The econometric task view provides an excellent summary of both parametric and nonparametric econometric packages that exist for the R environment and provides one-stop installation for these packages. See \href{http://cran.r-project.org/web/views/Econometrics.html}{cran.r-project.org/web/views/Econometrics.html} for further information. To automatically install a task view, the \texttt{ctv} package first needs to be installed and loaded, i.e., \begin{verbatim} install.packages("ctv") library("ctv") \end{verbatim} The econometric task view can then be installed via \texttt{install.views()} and updated via \texttt{update.views()} (which first assesses which of the packages are already installed and up-to-date), i.e., \begin{verbatim} install.views("Econometrics") \end{verbatim} or \begin{verbatim} update.views("Econometrics") \end{verbatim} \subsection{Is there a 'gentle guide' to the \texttt{np} package that contains some easy to follow examples?} Perhaps the most gentle introduction is contained in the \texttt{np} package itself in the form of a 'vignette'. To view the vignette run R, install the \texttt{np} package (\texttt{install.packages("np")}), then type \texttt{vignette("np",package="np")} to view or print the vignette (this vignette is essentially the article that appeared in the Journal of Statistical Software that describes the \texttt{np} package (Hayfield \& Racine \cite{HAYFIELD_RACINE:2008})). See also \texttt{vignette("entropy\_np",package="np")} for a vignette on the entropy-based functions and procedures introduced to the \texttt{np} package in versions 0.30-4 through 0.30-8. In addition, you might be interested in the nonparametric econometrics primer (Racine \cite{RACINE:2008}) which is available for download from my website as is the code that will generate the examples contained therein. For a listing of all routines in the \texttt{np} package type: \texttt{library(help="np")}. \subsection{I noticed you have placed a new version of the \texttt{np} package on CRAN. How can I determine what has been changed, modified, fixed etc?} See the CHANGELOG on the CRAN site (\url{http://cran.r-project.org/web/packages/np/ChangeLog}), or go to the end of this document where the CHANGELOG is provided for your convenience. \subsection{What is the difference between the \texttt{np} package and the previous stand-alone programs you wrote, namely, N \copyright, NPREG \copyright, and NPDEN \copyright?} The \texttt{np} package is built from the same C library that underlies its predecessors N \copyright, NPREG \copyright, and NPDEN \copyright. In fact, \texttt{R} calls the compiled C code that underlies its predecessors (one of the beauties of \texttt{R} is that you can obtain the benefits of compiled code (i.e., speed) yet have access to the rich superset of \texttt{R} routines and \texttt{R} packages built by others). Therefore, there is no penalty in run-time when using R versus the stand alone precompiled binary programs N \copyright, NPREG \copyright, and NPDEN \copyright\, (unless of course the compiler or compiler flags differ from those used to build its predecessors). \subsection{How can I read data stored in various formats such as Stata, SAS, Minitab, SPSS etc.~into the R program?} Install the foreign library via \texttt{install.packages("foreign")} then do something like \begin{singlespacing} \begin{verbatim} mydat <- read.dta("datafile.dta"), \end{verbatim} \end{singlespacing} where \texttt{datafile.dta} is the name of your Stata data file. Note that, as of version 0.8-34, the foreign package function \texttt{read.dta} supports reading files directly over the Internet making for more portable code. For instance, one could do something like \begin{verbatim} mydat <- read.dta(file="http://www.principlesofeconometrics.com/stata/mroz.dta") \end{verbatim} as one could always do with, say, \texttt{read.table()}. \subsection{I want to use so-and-so's semiparametric/nonparametric method, however, the \texttt{np} package does not include this particular method\dots} This is why we have included the function \texttt{npksum()}, which exists so that you can create your own kernel objects and take advantage of underlying kernel methods implemented in the \texttt{np} package without having to write, say, C or Fortran code. With the options available, you could create new nonparametric tests or even new kernel estimators. For instance, the convolution kernel option would allow you to replicate, say, the least squares cross-validation function for kernel density estimation found in \texttt{npudensbw()}. The function \texttt{npksum()} uses highly-optimized C code that strives to minimize its memory footprint, while there is low overhead involved when using repeated calls to this function. See, by way of illustration, the example in the \texttt{npksum()} help file that conducts leave-one-out cross-validation for a local constant regression estimator via calls to the \texttt{R} function \texttt{nlm()}, and compares this to the \texttt{npregbw()} function. If you wish to have a method incorporated into a future version of the \texttt{np} package, the best way to achieve this is to successfully code up the method using \texttt{npksum()}, briefly document it and write up an example, then send it to us. We will then, at our discretion, do our best to adapt and incorporate this into a future version and of course give credit where credit is due. \subsection{Cross-validation takes forever, and I can't wait that long\dots} This is the most common complaint from frustrated users coming to terms with numerically demanding statistical methods. I am fond of saying 'if you want the wrong answer, I can give it to you right away', but this wears thin quickly. \begin{enumerate} \item Some background may be in order. Suppose, by way of illustration, that you are conducting nonparametric regression (local constant or local linear) for a \textsl{given} bandwidth vector (nonparametric kernel regression is simply locally weighted least squares estimation). Let $g(x)$ denote the true conditional mean at some point $x$, and $\hat g(x)$ some estimate. To solve this problem requires computing weighted least squares not once (as would be the case for parametric regression), but $n$ times (the weights are local to the point at which you construct the regression model, $x_i$, $i=1,\dots,n$). That is, to compute $\hat g(x_i)$, $i=1,\dots,n$ requires an order of magnitude ($O(n)$) times more computation than that for parametric regression. So, for a \textsl{given} bandwidth, we require $O(n)$ more computation than parametric regression. But to determine the appropriate bandwidth vector, we minimize, say, a cross-validation function numerically, which requires a large number of evaluations of some objective function, say the delete-one least squares criterion function $CV=n^{-1}\sum_{i=1}^n(y_i-\hat g_{-i}(x_i))^2$ where $\hat g_{-i}(x_i)$ is the leave-one-out estimator of $g(x_i)$. This takes a problem that is already an order of magnitude more computationally involved than parametric regression and again increases the computational burden by potentially another order of magnitude. It is evident that as $n$ increases the computational burden associated with optimal bandwidth selection will increase exponentially. In general, cross-validation methods have run times that are proportional to the square of the number of observations (of computational order $n^2$ hence a doubling of the sample size will increase run time by a factor of four). The solution I currently favor is to run the code in a parallel computing environment. The underlying C code for \texttt{np} is MPI-aware (MPI denotes the 'message passing interface', a popular parallel programming library that is an international standard), and a version of the \texttt{np} package titled '\texttt{npRmpi}' exists for running jobs in a parallel environment that leverages the \texttt{Rmpi} package (\cite{Rmpi}).\footnote{We are extremely grateful to Hao Yu for providing the MPI functionality contained in the \texttt{Rmpi} package.} With respect to the \texttt{npRmpi} package, kindly note that I cannot assist with issues surrounding installation and setup due to the vast heterogeneity of MPI implementations and methods for executing such programs. You are instead strongly advised to seek local advice from your sysadmin or others. The document \texttt{npRmpi.pdf} may contain some useful information for such issues. You might also want to first get the \texttt{Rmpi} package installed and running properly as once you are at this stage it ought to then be trivial to install the \texttt{npRmpi} package. Also note that running programs in a parallel environment requires minor differences in how \texttt{np} functions are called, and you would be wise to examine the examples in the demo directory (download and unpack the source code and look in this directory) and the overview file \texttt{npRmpi.pdf} in the inst/doc/ directory of the \texttt{npRmpi} package you downloaded and unpacked. \item Alternatively, you can use the method outlined in Racine (1993) \cite{RACINE:1993}. The method is based on the fact that the unknown constant $c_j$ (the 'scale factor') in the formula $c_j\sigma_j n^{-1/(2p+r)}$ is independent of the sample size, so one can conduct bandwidth selection on random subsets and do this for a large number of subsets then take the mean/median over these subsets and feed the scale factor into the final routine for the entire sample. Below you will find simple \texttt{R} code that replicates the method using numerical search and resampling without replacement rather than the grid method outlined in \cite{RACINE:1993} (both have equivalent properties but this is perhaps simpler to implement using the \texttt{np} package). \begin{singlespacing} \begin{verbatim} ## Regression example ## Generate a moderately large data set set.seed(12345) n <- 100000 x1 <- runif(n) x2 <- runif(n) y <- 1 + x1 + sin(pi*x2) + rnorm(n,sd=.1) ## Set the number of resamples and the subsample size num.res <- 50 n.sub <- 250 ## Create a storage matrix bw.mat <- matrix(NA,nrow=num.res,ncol=2) ## Get the scale factors for resamples from the full sample of size n.sub options(np.messages=FALSE) for(i in 1:num.res) { cat(paste(" Replication", i, "of", num.res, "...\r")) bw.mat[i,] <- npregbw(y~x1+x2,regtype="ll", subset=sample(n,n.sub))$sfactor$x } ## A function to compute the median of the columns of a matrix colMedians <- function(data) { colmed <- numeric(ncol(data)) for(i in 1:ncol(data)) { colmed[i] <- median(data[,i]) } return(colmed) } ## Take the median scale factors bw <- colMedians(bw.mat) ## The final model for the full dataset model.res <- npreg(y~x1+x2,bws=bw,regtype="ll",bwscaling=TRUE) \end{verbatim} \end{singlespacing} \begin{singlespacing} \begin{verbatim} ## Hat tip to Yoshi Fujiwara for this ## nice example n <- 100000 library(MASS) rho <- 0.25 Sigma <- matrix(c(1,rho,rho,1),2,2) mydat <- mvrnorm(n=n, rep(0, 2), Sigma) x <- mydat[,1] y <- mydat[,2] rm(mydat) num.res <- 100 n.sub <- 100 bw.mat <- matrix(NA,nrow=num.res,ncol=2) options(np.messages=FALSE) for(i in 1:num.res) { bw <- npcdensbw(y ~ x,subset=sample(n,n.sub)) bw.mat[i,] <- c(bw$sfactor$y,bw$sfactor$x) } colMedians <- function(data) { colmed <- numeric(ncol(data)) for(i in 1:ncol(data)) { colmed[i] <- median(data[,i]) } return(colmed) } bw <- colMedians(bw.mat) bw <- npcdensbw(y ~ x, bws=bw, bwscaling=TRUE, bandwidth.compute=FALSE) summary(bw) plot(bw,xtrim=.01) \end{verbatim} \end{singlespacing} \item Barring this, you can set the search tolerances to be a bit less terse (at the expense of potential accuracy, i.e., becoming trapped in local minima) by setting, say, \texttt{tol=0.1} and \texttt{ftol=0.1} in the respective bandwidth routine (see the docs for examples). Also, you can set \texttt{nmulti=1} which overrides the number of times the search procedure restarts from different random starting values (the default is to restart $k$ times where $k$ is the number of variables). Be warned, however, that this is {\em definitely not recommended} and should {\em be avoided at all costs} for all but the most casual examination of a relationship. One ought to use multistarting for any final results and never override default search tolerances {\em unless increasing multistarts beyond the default}. Results based upon exhaustive search {\em often differ dramatically} from that based on limited search achieved by overriding default search tolerances. \item For those who like to tinker and who work on a *NIX system with the gcc compiler suite, you can change the default compiler switches used for building \texttt{R} packages which may generate some modest improvements in run time. The default compiler switches are \begin{singlespacing} \begin{verbatim} -g -O2 \end{verbatim} \end{singlespacing} and are set in the R-*.*.*/etc/Makeconf file (where *.*.* refers to your \texttt{R} version number, e.g. R-2.10.1). You can edit this file and change these defaults to \begin{singlespacing} \begin{verbatim} -O3 -ffast-math -fexpensive-optimizations -fomit-frame-pointer \end{verbatim} \end{singlespacing} then reinstall the \texttt{np} package and you may experience some improvements in run time. Note that the -g flag turns on production of debugging information which can involve some overhead, so we are disabling this feature. This is not a feature used by the typical applied researcher but if you envision requiring this it is clearly trivial to re-enable debugging. I typically experience in the neighborhood of a 0-5\% reduction in run time for data-driven bandwidth selection on a variety of systems depending on the method being used, though mileage will of course vary. \item Finally, I have recently been working on extending regression splines (as opposed to smoothing splines\footnote{Unlike regression splines, a smoothing spline places a knot at each data point and penalizes the fit for lack of smoothness as defined by the second derivative (typically cubic splines are used). When the penalty is zero this produces a function that interpolates the data points. When the penalty is infinite, this delivers the linear OLS fit to the data.}) to admit categorical predictors (see references for papers with Yang and Ma in the \texttt{crs} package). Because regression splines involve simple (i.e., 'global' as opposed to local) least-squares fits, they scale much better with respect to the sample size so for large sample sizes you may wish to instead consider this approach (kindly see my webpage for further details). You can run cross-validation with hundreds of thousands of observations on a decent desktop in far less time than that required for cross-validated kernel estimates. Be forewarned, however, that the presence of categorical predictors can lead to increased execution time unless the option \texttt{kernel=FALSE} is invoked (the default is to use local kernel weighting in the presence of categorical predictors). \end{enumerate} \subsection{Is there a way to figure out roughly how long cross-validation will take on a large sample?} Certainly. You can run cross-validation on subsets of your data of increasing size and time each run, then estimate a double-log model of sample size on run time (run time can be approximated by a linear log-log model) and then use this to predict approximate run-time. The following example demonstrates this for a simple model, but you can modify it trivially for your data. Note that the larger is \texttt{n.max} the more accurate it will likely be. Note that we presume your data is in no particular order (if it is, you perhaps ought to shuffle it first). We plot the log-log model fit and prediction along with that expressed in hours. \begin{singlespacing} \begin{verbatim} ## Set the upper bound (n.max > 100) for the sub-samples on which you ## will run cross-validation (perhaps n.max = 1000 (or 2000) ought to ## suffice). For your application, n will be your sample size n <- 2000 n.max <- 1000 x <- runif(n) y <- 1 + x + rnorm(n) n.seq <- seq(100,n.max,by=100) time.seq <- numeric(length(n.seq)) for(i in 1:length(n.seq)) { time.seq[i] <- system.time(npregbw(y~x,subset=seq(1:n.seq[i])))[3] } ## Now fit a double-log model and generate/plot actual values plus ## prediction for n (i.e., approximate run time in hours) log.time <- log(time.seq) log.n <- log(n.seq) model <- lm(log.time~log.n) n.seq.aug <- c(n.seq,n) time.fcst <- exp(predict(model,newdata=data.frame(log.n=log(n.seq.aug)))) par(mfrow=c(2,1)) plot(log(n.seq.aug),log(time.fcst),type="b", xlab="log(Sample Size)", ylab="log(Run Time)", main="Approximate Run Time (log seconds)") plot(n.seq.aug,time.fcst/3600,type="b", xlab="Sample Size (n)", ylab="Hours", main="Approximate Run Time (hours)", sub=paste("Predicted run time for n =", n, "observations:", signif(time.fcst[length(time.fcst)]/3600, digits=2), "hours")) \end{verbatim} \end{singlespacing} \subsection{Where can I get some examples of R code for the np package in addition to the examples in the help files?} Start \texttt{R} then type \verb+demo(package="np")+ and you will be presented with a list of demos for constrained estimation, inference, and so forth. To run one of these demos type, for example, \verb+demo(npregiv)+ (note that you must first install the \texttt{rgl} package to run this particular demo). To find the location of a demo type \verb+system.file("demo","npregiv.R",package="np")+ for example, then you can take the source code for this demo and modify it for your particular application. You can also find examples and illustrations at the 'Gallery' located at \url{http://socserv.mcmaster.ca/racinej/Gallery/Home.html}. \subsection{I need to apply a boundary correction for a univariate density estimate - how can I do this?} As of version 0.60.5, you can use either \texttt{npuniden.boundary} (boundary kernel functions) or \texttt{npuniden.reflect} (data reflection). \subsection{I notice your density estimator supports manual bandwidths, the normal-reference rule-of-thumb, likelihood cross-validation and least squares cross-validation, but not plug-in rules. How can I use various plug-in rules for estimating a univariate density?} For the univariate case this is straightforward as the default R installation supports a variety of univariate plug-in bandwidths (see \texttt{?bw.nrd} for details). For example, \texttt{bs.SJ} computes the bandwidths outlined in \begin{verbatim} Sheather, S. J. and Jones, M. C. (1991) A reliable data-based bandwidth selection method for kernel density estimation. _Journal of the Royal Statistical Society series B_, *53*, 683-690. \end{verbatim} Incorporating these univariate plug-in bandwidth selectors into the univariate density estimation routines in \texttt{np} is straightforward as the following code snippet demonstrates. They will certainly be faster than the likelihood and least-squares cross-validation approaches. Also, you may not wish the density estimate for all sample realizations, but rather for a shorter grid. The following example demonstrates both and is closer in spirit to the \texttt{density()} function in base R. \begin{verbatim} x <- rnorm(10000) xeval <- seq(min(x),max(x),length=100) f <- npudens(tdat=x,edat=xeval,bws=bw.SJ(x)) \end{verbatim} You might even be tempted to use these in the multivariate case via \texttt{bws=c(bw.SJ(x1),bw.SJ(x2),...)} though these would be optimal for univariate densities and most certainly not for a multivariate density. However, for exploratory purposes these may be of interest to some users. \subsection{\label{code_nowork}I wrote a program using \texttt{np} and it does not work as I expected\dots} There exist a rather extensive set of examples contained in the docs. You can run these examples by typing \texttt{example("npfunctionname")} where npfunctionname is, say, \texttt{w}, as in \texttt{example("w")}.\footnote{For a listing of all routines in the np package type: \texttt{library(help="np")}.} These examples all pass quality control and produce the expected results, so first see whether your problem fits into an existing example, and if not, carefully follow the examples listed in a given function for syntax issues etc. If you are convinced that the problem lies with \texttt{np} (there certainly will be undiscovered 'features', i.e., bugs), then kindly send me your code and data so that I can replicate and help resolve the issue. \subsection{Under Mac OS X, when I run a command no progress is displayed\dots} This should no longer occur for np versions 0.30-0 and up. For previous versions, this reflected a peculiarity of console input/output (I/O) under Mac OS X. Note, however, that if you run { \tt R} in a terminal rather than \texttt{Rgui} you will get the full *NIX\footnote{*NIX is often used to describe UNIX and other UNIX-like platforms (i.e., UNIX, BSD, and GNU/Linux distributions). I harbour strong preferences for *NIX computing platforms.} experience. But also note that there is a platform-independent interface that does not suffer from this limitation called 'RStudio' that you might prefer to existing interfaces (\url{http://www.rstudio.org}). \subsection{When some routines are running under MS Windows, \texttt{R} appears to be 'not responding.' It appears that the program is not 'hung', rather is simply computing. The previous stand-alone program (N \copyright) always displayed useful information\dots} This should no longer occur for np versions 0.30-0 and up. For previous versions, this reflected a peculiarity of the R Windows GUI, and was not specific to the \texttt{np} package. From the \texttt{R} Windows FAQ\dots ``When using Rgui the output to the console seems to be delayed. This is deliberate: the console output is buffered and re-written in chunks to be faster and less distracting. You can turn buffering off or on from the 'Misc' menu or the right-click menu: \verb++ toggles the setting.'' \subsection{Some results take a while, and my MS Windows computer is sluggish while \texttt{R} is running\dots} You can easily change the priority of your \texttt{R} job on the fly, just as you might under *NIX. Pull up the task manager (\verb+--+), go to the process list, and find the process \texttt{Rgui.exe} (or \texttt{R.exe} if you are running \texttt{Rterm}), select this process by left clicking on it, then right clicking will bring up a menu, select \texttt{Set priority}, then change priority to \texttt{low} and hit $<$\texttt{ok}$>$. For lengthy jobs this will make your life much smoother, and you can, say, run multiple jobs in low priority with no sluggishness whatsoever for your other applications (useful for queuing a number of long jobs). Alternatively, you could permanently change the default priority of \texttt{R} under MS Windows by modifying the properties of your \texttt{R} desktop icon. \subsection{\label{factor}A variable must be cast as, say, a factor in order for \texttt{np} to recognize this as an unordered factor. How can I determine whether my data is already cast as a factor?} Use the \texttt{class()} function. For example, define \texttt{x <- factor(c("male","female"))}, then type \texttt{class(x)}. \subsection{When I use \texttt{plot()} \texttt{(npplot())} existing graphics windows are overwritten. How can I display multiple graphics plots in separate graphics windows so that I can see results from previous runs and compare that to my new run?} Use the \texttt{dev.new()} command in between each call to \texttt{plot()}. This will leave the existing graphics window open and start a new one. The command \texttt{dev.list()} will list all graphics windows, and the command \texttt{dev.set(integer.foo)} will allow you to switch from one to another and overwrite existing graphics windows should you so choose. Alternately, you can use RStudio (\url{http://www.rstudio.org}) where the plot window is such that previous plots can be recalled on the fly, resized, saved in various formats etc. \subsection{\label{plot names}Sometimes \texttt{plot()} fails to use my variable names\dots} This should not occur unless you are using the data frame method and not naming your variables (e.g., you are doing something like data.frame(ordered(year))). To correct this, name your variables in the respective data frame, as in \texttt{data <- data.frame(year=ordered(year),gdp)} \noindent so that the ordered factor appears as 'year' and not 'ordered.year' \subsection{Sometimes \texttt{plot()} appends my variable names with .ordered or .factor\dots} See also \ref{plot names} above. \subsection{I specify a variable as \texttt{factor()} or \texttt{ordered()} in a data frame, then call this when I conduct bandwidth selection. However, when I try to plot the resulting object, it complains with the following message:} \texttt{Error in eval(expr, envir, enclos) : object ``variable'' not found\dots} This arises because \texttt{plot()} (\texttt{npplot()}) tries to retrieve the variable from the environment but you have changed the definition when you called the bandwidth selection routine (e.g., \texttt{npregbw(y\textasciitilde x,data=dataframe)}). To correct this, simply call \texttt{plot()} with the argument \texttt{data=dataframe} where \texttt{dataframe} is the name of your data frame. \subsection{My \texttt{np} code produces errors when I attempt to run it\dots} First, it is good practice to name all arguments (see the docs for examples) as in \texttt{npregbw(formula=y\textasciitilde x)} (i.e., explicitly call formula for functions that use named formulas). This will help the code return a potentially helpful error message. Next, follow the examples listed at the end of each function help page closely (i.e., \texttt{?npreg} then scroll down to \texttt{Examples:}). See also \ref{code_nowork} above. \subsection{\label{0/1}I have (numeric) 0/1 dummy variable regressors in my parametric model. Can I just pass them to the \texttt{np} functions as I would in a parametric setting?} In general, definitely not -- you need to correctly classify each variable as type \texttt{factor} and treat it as one variable only. By way of example, suppose in your data you have created dummy variables for year, for example, dummy06 which equals 1 for 2006, 0 otherwise, dummy07 which equals 1 for 2007, 0 otherwise etc. We create these by habit for parametric models. But, the underlying variable is simply year, which equals 2006, 2007, and so forth. In \texttt{np} (and \texttt{R} in general), you get to economize by just telling the function that the variable 'year' is ordered, as in ordered(year), where year is a vector containing elements 2006, 2007 etc. Of course, seasoned R users would appreciate that this is in fact the simple way to do it with a parametric model as well. You would {\em never}, therefore, just pass dummy variables to an \texttt{np} function as you would for linear parametric models. The {\em only} exception is where you have only one 0/1 dummy for one variable, say '\texttt{sex}', and in this case you still would have to enter this as \texttt{factor(sex)} so that the \texttt{np} function recognizes this as a factor (otherwise it would treat it as continuous and use a kernel function that is inappropriate for a factor). \subsection{\label{terminates}I have a categorical variable, ignored the advice in items \ref{factor} and \ref{0/1}, and R terminates with the following error.} \begin{verbatim} ** Fatal Error in routine kernel_bandwidth() ** variable 0 appears to be constant! ** Program terminated abnormally! \end{verbatim} Presuming your variable is not in fact a constant (i.e., is not in fact a 'variable'), this can only occur in the case where a variable is 'pathological' in that it has \texttt{IQR=0} but \texttt{std>0} (see the section titled 'Changes from Version 0.30-1 to 0.30-2 [19-Apr-2009]' for further details). This should no longer occur for np versions 0.30-2 and up. \subsection{Rgui appears to crash. There must be a bug in the program.} Try running your code using the terminal (i.e., \texttt{Rterm} in Windows or \texttt{R} in a Mac OS X terminal) and see whether you get the message in item \ref{terminates}. \subsection{Can I skip creating a bandwidth object and enter a bandwidth directly?} Certainly, though I would advise doing so for exploratory data analysis only. For example, attach a dataset via \begin{singlespacing} \begin{verbatim} data(cps71) attach(cps71) \end{verbatim} \end{singlespacing} then enter, say, \begin{singlespacing} \begin{verbatim} plot(age,logwage,main="Manual Bandwidth Example") lines(age,fitted(npreg(logwage~age,bws=1)),col="blue",lty=1) lines(age,fitted(npreg(logwage~age,bws=2)),col="red",lty=2) lines(age,fitted(npreg(logwage~age,bws=3)),col="green",lty=3) legend(20,15, c("h=1","h=2","h=3"), col=c("blue","red","green"), lty=c(1,2,3)) \end{verbatim} \end{singlespacing} to plot the local constant estimator with bandwidths of 1, 2, and 3 years. Note that the age variable is already sorted in this dataset. If your data is not sorted you will need to do so prior to plotting so that your \texttt{lines} command works properly. Or see \ref{partial grad q} below for a multivariate example. \subsection{\label{partial grad q}When I estimate my gradients and there are two or more covariates and then extract them with the \texttt{gradients()} function, they are not 'smooth', though if I plot a model with the \texttt{gradients=TRUE} option, they are. The \texttt{gradients()} function must be broken\dots} The function \texttt{plot()} (\texttt{npplot()}) plots 'partial' means and gradients. In other words, it plots $x_1$ versus $\hat g(x_1,\bar x_2)$ for the partial mean, where $\bar x_2$ is, say, the median/modal value of $x_2$. It also plots $x_1$ versus $\partial\hat g(x_1,\bar x_2)/\partial x_1$ for the gradient. Note that we are controlling for the values of the other covariate(s). This is in effect what people expect when they play with linear parametric models of the form $y=\beta_0+\beta_1 x_1+\beta_2x_2+\epsilon$ since, given the additive nature of the model, $\partial y/\partial x_1=\beta_1$ (i.e., does not vary with $x_2$). The example below shows how you could manually generate the partial gradients (and means) for your data where the sample realizations form the evaluation data for $x_1$ (unlike \texttt{npplot()} which uses an evenly spaced grid). Note we use the function \texttt{uocquantile()} to generate a vector that holds $x_2$ constant at is median/modal value (i.e., the 0.5 quantile) in the evaluation data. The function \texttt{uocquantile()} can compute quantiles for ordered, unordered, and continuous data (see \texttt{?uocquantile} for details). \begin{singlespacing} \begin{verbatim} n <- 100 x1 <- runif(n) x2 <- runif(n) y <- x1^2+x2^2 + rnorm(n,sd=.1) data.train <- data.frame(x1,x2,y) bw <- npregbw(y~x1+x2, data=data.train, regtype="ll", bwmethod="cv.aic") data.eval <- data.frame(x1 = sort(x1), x2 = rep(uocquantile(x2,.5),n)) model <- npreg(bws=bw, data=data.train, newdata=data.eval, gradients=TRUE) plot(data.eval[,1],model$grad[,1],xlab="X1",ylab="Gradient",type="l") \end{verbatim} \end{singlespacing} Note that this uses the sorted sample realizations for \texttt{x1} which is perfectly acceptable. If you want to mimic \texttt{plot()} exactly you will see that by default plot uses a grid of equally spaced values of length \texttt{neval=50} as per below. Both are perfectly acceptable and the point is they control for the level of the non-axis variables. \begin{singlespacing} \begin{verbatim} data.eval <- data.frame(x1 = seq(min(x1),max(x1),length=50), x2 = rep(uocquantile(x2,.5),50)) \end{verbatim} \end{singlespacing} \subsection{I use \texttt{plot()} \texttt{(npplot())} to plot, say, a density and the resulting plot looks like an inverted density rather than a density\dots} This can occur when the data-driven bandwidth is dramatically undersmoothed. Data-driven (i.e., automatic) bandwidth selection procedures are not guaranteed always to produce good results due to perhaps the presence of outliers or the rounding/discretization of continuous data, among others. By default, \texttt{npplot()} takes the two extremes of the data (minimum, maximum i.e., actual data points) then creates an equally spaced grid of evaluation data (i.e., not actual data points in general) and computes the density for these points. Since the bandwidth is extremely small, the density estimate at these evaluation points is correctly zero, while those for the sample realizations (in this case only two, the min and max) are non-zero, hence we get two peaks at the edges of the plot and a flat bowl equal to zero everywhere else. This can also happen when your data is heavily discretized and you treat it as continuous. In such cases, treating the data as ordered may result in more sensible estimates. \subsection{Can \texttt{npksum()} compute analytical derivatives with respect to a continuous variable?} As of version 0.20-0 and up, yes it can, using the \texttt{operator = "derivative"} argument, which is put to its paces in the following code snippet (this supports multiple arguments including \texttt{"integral"} and \texttt{"convolution"} in addition to \texttt{"normal"}, the default). \begin{singlespacing} \begin{verbatim} Z <- seq(-2.23,2.23,length=100) Zc <- seq(-4.47,4.47,length=100) par(mfrow=c(2,2)) plot(Z,main="Kernel",ylab="K()",npksum(txdat=0,exdat=Z,bws=1, ckertype="epanechnikov",ckerorder=2,operator="normal")$ksum, col="blue",type="l") plot(Z,main="Kernel Derivative",ylab="K()",npksum(txdat=0,exdat=Z,bws=1, ckertype="epanechnikov",ckerorder=2,operator="derivative")$ksum, col="blue",type="l") plot(Z,main="Kernel Integral",ylab="K()",npksum(txdat=0,exdat=Z,bws=1, ckertype="epanechnikov",ckerorder=2,operator="integral")$ksum, col="blue",type="l") plot(Zc,main="Kernel Convolution",ylab="K()",npksum(txdat=0,exdat=Zc,bws=1, ckertype="epanechnikov",ckerorder=2,operator="convolution")$ksum, col="blue",type="l") \end{verbatim} \end{singlespacing} An alternative to computing analytical derivatives is to compute them numerically using finite-differences. One simply computes the kernel sum evaluating the sum with variable $j$ set at $x_j-h_j/2$ and calls this, say, $ksum_{j1}$, then again set at $x_j+h_j/2$ and call this $ksum_{j}2$, then compute $\nabla=(ksum_{j2}-ksum_{j1})/h_j$. This method has been used for both theoretical and applied work and produces consistent estimates of the derivatives, as of course do the analytical derivatives, providing that $h\to0$ as $n\to\infty$ (which will {\em not} be the case in some settings, i.e., in the presence of irrelevant covariates and so forth). The following example provides a simple demonstration. See \ref{partial grad q} above for multivariate partial regression when using this method. \begin{singlespacing} \begin{singlespacing} \begin{verbatim} ## In this example we consider the local constant estimator computed ## using npksum, and then use npksum to compute numerical derivatives ## using finite-difference methods, then finally compare them with the ## analytical ones. data(cps71) attach(cps71) ## Grab the cross-validated bandwidth bw <- npregbw(logwage~age) h <- bw$bw[1] ## Evaluate the local constant regression at x-h/2, x+h/2... ksum.1 <- npksum(txdat=age, exdat=age-h/2,tydat=logwage,bws=bw)$ksum/ npksum(txdat=age,exdat=age-h/2,bws=bw)$ksum ksum.2 <- npksum(txdat=age, exdat=age+h/2,tydat=logwage,bws=bw)$ksum/ npksum(txdat=age,exdat=age+h/2,bws=bw)$ksum ## Compute the numerical gradient... grad.numerical <- (ksum.2-ksum.1)/h ## Compare with the analytical gradient... grad.analytical <- gradients(npreg(bws=bw,gradient=TRUE)) ## Plot the resulting estimates... plot(age,grad.numerical,type="l",col="blue",lty=1,ylab="gradient") lines(age,grad.analytical,type="l",col="red",lty=2) legend(20,-0.05,c("Numerical","Analytic"),col=c("blue","red"),lty=c(1,2)) \end{verbatim} \end{singlespacing} \end{singlespacing} \subsection{Can I use the \texttt{npcmstest()} function that implements the consistent test for correct specification of parametric regression models as described in Hsiao, Li, \& Racine (2007) \cite{HSIAO_LI_RACINE:2007} to test for correct specification of the semiparametric partially linear model?} As Brennan Thompson points out, yes, you can. To test a parametric linear specification against a semiparametric partially linear alternative, i.e., \begin{align*} H_0: y &= X'\beta + Z'\gamma + u\\ H_1: y &= X'\beta + g(Z) + u, \end{align*} you could use \texttt{npcmstest()} as follows: \begin{singlespacing} \begin{verbatim} lmodel <- lm(y~X+Z,y=TRUE,x=TRUE) uhat <- resid(lmodel) npcmstest(xdat=Z,ydat=uhat,model=lmodel) \end{verbatim} \end{singlespacing} A slightly better way (as discussed in Li \& Wang (1998) \cite{LI_WANG:1998}) would be to use a 'mixed' residual, i.e., $\hat u_i = y_i - X_i'\tilde \beta - Z_i'\hat \gamma$ in the test, where $\tilde\beta$ is the semiparametric estimator of $\beta$ (based on the semiparametric partially linear model), and $\hat \gamma$ is the OLS estimator of $\gamma$ based on the linear model. This could lead to potential power gains due to the improved efficiency of $\hat\beta$ under the alternative. \subsection{I am using \texttt{npcmstest()} on a \texttt{glm()} object (the document says glm() objects are supported and I am estimating a Logit model) but it returns an error saying} \begin{verbatim} Error in eval(expr, envir, enclos) : y values must be 0 <= y <= 1. \end{verbatim} \texttt{npcmstest()} supports conditional mean models with continuous outcomes (\texttt{glm()} objects are supported so that models that are nonlinear in parameters can be estimated). The test is based on residual bootstrapping to generate resamples for $Y$. In particular, a resample for the residual vector ($\hat\epsilon^*$) is added to the models' fit (i.e., $Y^*=\hat Y + \hat\epsilon^*$) to generate a resample under the null. This excludes binary outcome models and the like because you would have generated a resample for $Y$ that no longer contains zeros and ones, hence the error message. Note that \texttt{npcmstest} supports regression objects generated by \texttt{lm} and uses features specific to objects of type \texttt{lm} hence if you attempt to pass objects of a different type the function cannot be expected to work. \subsection{I want to plot the kernel function itself. How can I do this?} Use the \texttt{npksum()} function and switch the evaluation and training roles as in the following example that plots the 2nd, 4th, 6th and 8th order Epanechnikov kernels. \begin{singlespacing} \begin{verbatim} Z <- seq(-sqrt(5),sqrt(5),length=100) par(mfrow=c(2,2)) plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov", ckerorder=2)$ksum,type="l",main="Epanechnikov [order = 2]") plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov", ckerorder=4)$ksum,type="l",main="Epanechnikov [order = 4]") plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov", ckerorder=6)$ksum,type="l",main="Epanechnikov [order = 6]") plot(Z,ylab="kernel",npksum(txdat=0,exdat=Z,bws=1,ckertype="epanechnikov", ckerorder=8)$ksum,type="l",main="Epanechnikov [order = 8]") \end{verbatim} \end{singlespacing} \subsection{In version 0.20-0 and up I can 'combine' steps such as bandwidth selection and estimation. But when I do \texttt{summary(model)} I don't get the same summary that I would get from, say, \texttt{summary(bw)} and then \texttt{summary(model)}. How do I get bandwidth object summaries when combining steps?} Don't worry, the bandwidth object exists when you do the combined steps and is easily accessed via \texttt{summary(model\$bws)} or extracted via \texttt{bw <- model\$bws} where \texttt{model} is the name of your model. \subsection{I estimated a semiparametric index model via \texttt{model <- npindex(y\textasciitilde x1+x2) } but \texttt{se(model)} returns NULL}. Perhaps you want \texttt{vcov(model)} instead (i.e., the asymptotic variance-covariance matrix)? This is supported as of version 0.40-1 provided that you set \texttt{gradients=TRUE} as the following snippet demonstrates: \begin{verbatim} set.seed(42) n <- 250 x1 <- runif(n, min=-1, max=1) x2 <- runif(n, min=-1, max=1) y <- ifelse(x1 + x2 + rnorm(n) > 0, 1, 0) ## Note that the first element of the vector beta is normalized to one ## for identification purposes hence the first row and column of the ## covariance matrix will contain zeros. model <- npindex(y~x1+x2, method="kleinspady", gradients=TRUE) vcov(model) Z <- coef(model)[-1]/sqrt(diag(vcov(model)))[-1] Z \end{verbatim} Note that, alternatively, you can get robust bootstrapped standard errors for the estimated model and gradients by adding the argument \texttt{errors=TRUE} to your call to npindex so that \texttt{se(model)} returns the vector of standard errors for the estimated conditional mean where \texttt{model} is the name of your model. Note that \texttt{model\$merr } will contain the standard errors returned by \texttt{se(model)} while \texttt{model\$gerr} will return the matrix of standard errors for the gradients {\em provided} you have set \texttt{gradients=TRUE} (furthermore, \texttt{model\$mean.grad} and \texttt{model\$mean.gerr} will give the average derivative and its bootstrapped standard errors). See the documentation of npindex for further details. The following code snippet demonstrates how one could do this for a simulated dataset. \begin{verbatim} n <- 100 x1 <- runif(n, min=-1, max=1) x2 <- runif(n, min=-1, max=1) y <- x1 - x2 + rnorm(n) bw <- npindexbw(formula=y~x1+x2) model <- npindex(bws=bw,errors=TRUE,gradients=TRUE) se.mean <- model$merr se.grad <- model$gerr \end{verbatim} \subsection{How do I interpret gradients from the conditional density estimator?} If you plot a conditional density $f(y|x)$ when $x$ is a scalar, with gradients, by default you will get the following: \begin{enumerate} \item A plot of $\partial f(y = \text{median} | x)/\partial x$ (admittedly not the most useful plot). (If $y$ is discrete the only difference is that you get a plot of $\partial f(y = \text{(unconditional) mode} | x)/\partial x$). \item A plot of $\partial f(y | x = \text{median})/\partial x$. \end{enumerate} If $x$ is multivariate (for example, 2D) you get: \begin{enumerate} \item A plot of $\partial f(y = \text{median} | x1, x2 = \text{median})/\partial x_1$ \item A plot of $\partial f(y = \text{median} | x1, x2 = \text{median})/\partial x_2$ \item A plot of $\partial f(y = \text{median} | x1 = \text{median}, x2)/\partial x_1$ \item A plot of $\partial f(y = \text{median} | x1 = \text{median}, x2)/\partial x_2$ \item A plot of $\partial f(y | x1 = \text{median}, x2 = \text{median})/\partial x_1$ \item A plot of $\partial f(y | x1 = \text{median}, x2 = \text{median})/\partial x_2$ \end{enumerate} \subsection{When I run \texttt{R} in batch mode via \texttt{R CMD BATCH filename.R} unwanted status messages (e.g., ``Multistart 1 of 10'') crowd out desired output. How can I turn off these unwanted status messages?} After loading the np library add the line \texttt{options(np.messages=FALSE)} and all such messages will be disabled. \subsection{I am getting an 'unable to allocate...' message after repeatedly interrupting large jobs.} Repeated interruption of large jobs can reduce available memory under R. This occurs because memory is allocated dynamically, and memory that has been allocated is not freed when you interrupt the job (the routine will clean up after itself only if it is allowed to complete all computations - when you interrupt you never reach the point in the code where the memory is freed). If this becomes an issue simply restart R (i.e., exit then run a fresh R session). \subsection{I have a large number of variables, and when using the formula interface I get an error message stating 'invoked with improper formula'. I have double checked and everything looks fine. Furthermore, it works fine with the data frame interface.} The issue is that converting formulas into character strings in \texttt{R} appears to be limited to 500 characters. We are not aware of a simple workaround so we simply advise that you use the data frame interface when this occurs. \subsection{How can I estimate additive semiparametric models?} Generalized additive semiparametric models (see Hastie and Tibshirani (1990) \cite{HASTIE_TIBSHIRANI:1990}) are implemented in the \texttt{gam} package (though they do not support categorical variables). The \texttt{gam} package function \texttt{gam()} (the \texttt{mgcv} package also contains a similar function by the same name) uses iteratively reweighted least squares and either smoothing splines (\texttt{s($\cdot$)}) or local polynomial regression fitting (\texttt{'loess', lo($\cdot$)}, with default manual 'span' of 0.75, the parameter which controls the degree of smoothing). The following code snippet demonstrates the capabilities of the \texttt{gam()} function via the \verb+wage1+ dataset included in the \texttt{np} packages using three numeric regressors. \begin{verbatim} library(gam) data(wage1) attach(wage1) model.gam <- gam(lwage~s(educ)+s(exper)+s(tenure)) par(mfrow=c(2,2)) plot(model.gam,se=T) detach(wage1) \end{verbatim} The \texttt{mgcv} package also has an implementation of generalized additive models via the identical function name, \texttt{gam()}. The biggest difference is that this uses generalized cross validation to select the 'span' (rather than manually) hence the degree of smoothing becomes part of the method (as is the case for all functions in the \texttt{np} package). The following code snippet demonstrates the \texttt{mgcv} implementation of the \texttt{gam()} function. \begin{verbatim} library(mgcv) data(wage1) attach(wage1) model.gam <- gam(lwage~s(educ)+s(exper)+s(tenure)) par(mfrow=c(2,2)) plot(model.gam,se=T) detach(wage1) \end{verbatim} \subsection{I am using \texttt{npRmpi} and am getting an error message involving \texttt{dlopen()}} One of the most common problems experienced by users attempting to install and run MPI-aware programs is to first correctly identify the location of libraries and headers for the local MPI installation so that installation can proceed. By way of illustration, the following environment variables need to be set for the MacPorts version of OpenMPI (www.macports.org) running on Mac OS X 10.8.3 (the environment commands listed below are for those using the 'bash' shell): \begin{verbatim} export LD_LIBRARY_PATH=/opt/local/lib export RMPI_LIB_PATH=/opt/local/lib export RMPI_TYPE=OPENMPI export RMPI_INCLUDE=/opt/local/include/openmpi launchctl setenv LD_LIBRARY_PATH $LD_LIBRARY_PATH launchctl setenv RMPI_LIB_PATH $RMPI_LIB_PATH launchctl setenv RMPI_TYPE $RMPI_TYPE launchctl setenv RMPI_INCLUDE $RMPI_INCLUDE \end{verbatim} Once set, R (and optionally RStudio) ought to function as expected. However, problems encountered during this phase are best resolved by someone with familiarity of the local installation. \subsection{I am using the \texttt{npRmpi} package but when I launch one of the demo parallel jobs I get the error \texttt{Error: could not find function ``mpi.bcast.cmd''}} When your demo program stops at the following point in your file \begin{verbatim} > mpi.bcast.cmd(np.mpi.initialize(), + caller.execute=TRUE) Error: could not find function "mpi.bcast.cmd" \end{verbatim} this likely means that either you have failed to place the \texttt{.Rprofile} file in the current or root directory as directed, or the \texttt{.Rprofile} initialization code is not being loaded as expected. \begin{enumerate} \item Make sure a copy of the initialization file \texttt{Rprofile} exists in your working or root directory and is named \texttt{.Rprofile} \item Make sure you did not run \texttt{R} with either the \texttt{--no-init-file} or \texttt{--vanilla} option (this combines a number of options including \texttt{--no-init-file} which will disable reading of \texttt{.Rprofile}) \end{enumerate} \subsection{I have estimated a partially linear model and want to extract the gradient/fitted values of the nonparametric component} People use partially linear models because they focus on the parametric component and treat the nonparametric component as a nuisance. In fact, the partially linear model is estimated by carefully getting rid of the nonparametric component $g(Z)$ prior to estimation, and then estimating a set of conditional moments nonparametrically. However, suppose after getting rid of this nonparametric nuisance component we then wished to construct a consistent estimator of $g(Z)$, the nonparametric component for a partially linear model $y = X\beta + g(Z) + u$. We might proceed as follows. \begin{verbatim} library(np) set.seed(42) n <- 250 x1 <- rnorm(n) x2 <- rbinom(n, 1, .5) z1 <- rbinom(n, 1, .5) z2 <- rnorm(n) y <- 1 + x1 + x2 + z1 + sin(z2) + rnorm(n) ## First, construct the partially linear model using local linear ## regression. model <- npplreg(y~x1+factor(x2)|factor(z1)+z2,regtype="ll") ## Next, subtract the fitted parametric component from y so that we ## have y-xbetahat. Since we can have factors we need to create the ## 'model matrix' but make sure we don't keep the intercept generated ## by model.matrix hence the [,-1]. This gives us the numeric matrix X ## which we multiply by the coefficient vector to obtain xbetahat which ## we can subtract from y. y.sub <- y-model.matrix(~x1+factor(x2))[,-1]%*%coef(model) ## Finally, regress this on the nonparametric components Z using npreg. model.sub <- npreg(y.sub~factor(z1)+z2,regtype="ll",gradients=TRUE) ## Now we can obtain derivatives etc. However, note that this is the ## model containing the transformed y with respect to the nonparametric ## components. We can use gradients(model.sub) etc. or plot them and ## so forth. plot(model.sub,gradients=TRUE) \end{verbatim} \subsection{The R function 'lag()' does not work as I expect it to. How can I create the $l$th lag of a numeric variable in R to be fed to functions in the \texttt{np} package?} As of version 0.60-2, time series objects are supported. However, if you prefer you can use the function \texttt{ts.intersect}, which can exploit R's lag function but return a suitable data frame, as per the following illustration: \begin{verbatim} data(lynx) loglynx <- log(lynx) lynxdata <- ts.intersect(loglynx, loglynxlag1=lag(loglynx,-1), dframe=TRUE) model <- npregbw(loglynx~loglynxlag1, data=lynxdata) plot(model) \end{verbatim} Note that in order for the above to work, the common argument fed to \texttt{ts.intersect} must be a \texttt{ts} object, so first cast it as such if it is not already (\texttt{log(lynx)} is a \texttt{ts} object since \texttt{lynx} was itself a \texttt{ts} object). Or, you can use the \texttt{embed} function to accomplish this task. Here is a simple function that might work more along the lines that you expect. By default we 'pad' the vector with NAs but you can switch this to \texttt{FALSE} if you prefer. The function will return a vector of the same length as the original vector with NA's padded for the missing values. \begin{verbatim} lag.numeric <- function(x,l=1,pad.NA=TRUE) { if(!is.numeric(x)) stop("x must be numeric") if(l < 1) stop("l (lag) must be a positive integer") if(pad.NA) x <- c(rep(NA,l),x) return(embed(x,l+1)[,l+1]) } x <- 1:10 x.lag.1 <- lag.numeric(x,1) \end{verbatim} \subsection{Can I provide sample weights to be used by functions in the \texttt{np} package?} Unfortunately, at this stage the answer is 'no', at least not directly with many of the functions as they stand. However, the function \texttt{npksum} underlies many of the functions in the \texttt{np} package and it supports passing of weights, so you may be able to use this function and do a bit of coding to fulfill your needs. Kindly see \texttt{?npksum} for illustrations. Alternatively, if reweighting of sample data is sufficient for your needs then you can feed the weighted sample realizations directly to existing functions. \subsection{Quantile regression is slow, particularly for very small/large values of tau. Can this be sped up?} Yes, indeed this can be the case for extreme values of tau (and not so extreme values as well). The reason for this is because numerical methods are used to invert the CDF. This must be done for each predictor observation requiring the solution to $n$ (or $neval$) optimization problems. An alternative is to compute the 'pseudo-inverse' via a 'lookup method'. In essence, one computes the conditional CDF for a range of $y$ values, and then computes the pseudo-inverse which is defined as \begin{align*} q_{\tau}(x) &= \inf \{ y: \, F(y|x) \geq \tau \}\\ & = \sup \{ y: \, F(y|x) \le \tau \} \end{align*} The following code demonstrates this approach for the example used in the help file for \texttt{npqreg} (see \texttt{?npqreg}). \begin{verbatim} data("Italy") attach(Italy) bw <- npcdensbw(gdp~year) ## Set a grid of values for which the conditional CDF will be computed ## with a range that extends well beyond the range of the data n.eval <- 1000 gdp.er <- extendrange(gdp,f=2) gdp.q <- quantile(gdp,seq(0,1,length=n.eval)) gdp.eval <- sort(c(seq(gdp.er[1],gdp.er[2],length=n.eval),gdp.q)) n.q <- length(gdp.eval) ## We only need to compute the conditional quantiles for each unique ## value of year year.unique <- unique(year) ## Consider a range of values for tau tau.seq <- c(0.01,0.05,0.25,0.5,0.75,0.95,0.99) gdp.tau <- matrix(NA,length(year.unique),length(tau.seq)) for(j in 1:length(tau.seq)) { cat("\r",j,"of",length(tau.seq)) tau <- tau.seq[j] for(i in 1:length(year.unique)) { F <- fitted(npcdist(bws=c(bw$ybw,bw$xbw), txdat = year, tydat = gdp, exdat = rep(year.unique[i],n.q), eydat = gdp.eval)) ## For each realization of the predictor, compute the ## the pseudo-inverse gdp.tau[i,j] <- ifelse(tau>=0.5, max(gdp.eval[F<=tau]), min(gdp.eval[F>=tau])) } } ## Plot the results plot(year,gdp,ylim=c(min(gdp.tau,gdp),max(gdp.tau,gdp))) for(j in 1:length(tau.seq)) { lines(year.unique,gdp.tau[,j],col=j+1,lty=j,lwd=2) } legend(min(year),max(gdp.tau,gdp), paste("tau=",tau.seq), col=1:length(tau.seq)+1, lty=1:length(tau.seq), lwd=2) \end{verbatim} \subsection{How can I generate resamples from the unknown distribution of a set of data based on my smooth kernel density estimate?} This can be accomplished by picking a sample realization, uniformly at random, then drawing from the kernel distribution centered on that training point with scale equal to the bandwidth. Below is a demonstration for the 'Old Faithful' data where we draw a random sample of size $n=1,000$ where a Gaussian kernel was used for the density estimator. \begin{verbatim} data(faithful) n <- nrow(faithful) x1 <- faithful$eruptions x2 <- faithful$waiting ## First compute the bandwidth vector bw <- npudensbw(~x1+x2,ckertype="gaussian") ## Next generate draws from the kernel density (Gaussian) n.boot <- 1000 i.boot <- sample(1:n,n.boot,replace=TRUE) x1.boot <- rnorm(n.boot,x1[i.boot],bw$bw[1]) x2.boot <- rnorm(n.boot,x2[i.boot],bw$bw[2]) ## Plot the density for the bootstrap sample using the original ## bandwidths plot(npudens(~x1.boot+x2.boot,bws=bw$bw),view="fixed",xtrim=-.2,neval=100) \end{verbatim} %% See page 294 and up in cosma_resampling_ch15.pdf in ~/R/ \subsection{Some of my variables are measured with an unusually large number of digits\dots should I rescale?} In general there should be no need to rescale data for one's statistical analysis. However, occasionally one can encounter issues with numerical accuracy (e.g.\ numerical 'overflow') regardless of the method used. It is therefore prudent to be aware of this issue. For instance, if your dependent variable represents housing prices in a particular currency and entries are recorded as e.g.\ 145,000,000,000,000,000,000 foounits, then it might be prudent to deflate this variable by $10^{20}$ (i.e.\ 1.45) so that e.g.\ sums of squares are numerically stable (say when using least squares cross-validation). Of course you can run your analysis with and without the adjustment and see whether it matters or not. But it is sometimes surprising that such things can in fact make a difference. See \texttt{?scale} for a function whose default method centers and/or scales the columns of a numeric matrix. \subsection{The local linear gradient estimates appear to be somewhat 'off' in that they do not correspond to the first derivative of the estimated regression function.} It is perhaps not a widely known fact that the local linear partial derivatives obtained from the coefficients of the Taylor approximation, $\hat b(x)$, are not the analytical derivative of the estimated regression $\hat g(x)$ with respect to $x$. The local linear estimator is obtained by minimizing the following weighted least squares function (consider the one-predictor case to fix ideas) \begin{equation*} \sum_i(Y_i - a - b (x-X_i))^2 K\left(\frac{x-X_i}{h}\right) \end{equation*} and the estimated regression function $\hat g(x)$ is given by \begin{equation*} \hat g(x) =\hat a \end{equation*} while the gradient is given by \begin{equation*} \hat \beta(x)=\hat b. \end{equation*} But the analytical gradient (i.e.\ the partial derivative of $\hat g(x)=\hat a$ with respect to $x$) is \textsl{not} $\hat b$ unless the bandwidth is very large (i.e.\ $h=\infty$) in which case $K((x-X_i)/h)=K(0)$ and one gets the standard linear least squares estimates for $\hat g(x)$ and $\hat\beta(x)$ (if you compute the partial algebraically you will see they are not the same). So, the derivative estimates arising directly from the local linear estimator will differ from the analytical derivatives, even though they are asymptotically equivalent under standard conditions required for consistency. Thus, if economic constraints are imposed on the direct derivatives, this may produce an estimated surface which is not consistent with the constraints. This can be avoided by imposing the constraints on the analytical derivatives of the local polynomial estimator being used. \subsection{How can I turn off all console I/O?} To disable all console I/O, set \texttt{options(np.messages=FALSE)} and wrap the function call in \texttt{suppressWarnings()} to disable any warnings printed to the console. For instance \begin{verbatim} library(np) options(np.messages=FALSE) set.seed(42) n <- 100 x <- sort(rnorm(n)) z <- factor(rbinom(n,1,.5)) y <- x^3 + rnorm(n) model.np <- suppressWarnings(npreg(y~x+z)) \end{verbatim} ought to produce no console I/O whatsoever in the call to \texttt{npreg}. \subsection{Is there a way to exploit the sparse nature of the design matrix when all predictors are categorical and reduce computation time, particularly for cross-validation?} Certainly. This would occur if, for instance, you had 4 predictors each of which was binary 0/1, so there are only $2^4=16$ possible unique rows in the design matrix. The following conducts kernel regression with exclusively binary predictors by way of illustration. \begin{verbatim} ## This routine exploits a sparse design matrix when computing the ## kernel estimator of a conditional mean presuming unordered ## categorical predictors. Numerical optimization is undertaken, and ## this function can be much faster than that ignoring the sparse ## structure for large n and small k (the function npreg() currently ## does not attempt to exploit this structure). rm(list=ls()) ## Set the number of observations and number of binary predictors n <- 1000 k <- 4 ## Create data/DGP z <- matrix(rbinom(n*k,1,.5),n,k) y <- rowSums(z) + rnorm(n) ## Fire up uniquecombs() on a matrix of discrete support predictors. ## We call the crs() variant of uniquecombs (the function is taken ## from mgcv(), but mgcv() has serious load time overhead, and I don't ## want to rely on other packages unnecessarily). z.unique <- crs:::uniquecombs(as.matrix(z)) num.z <- ncol(z.unique) ind <- attr(z.unique,"index") ind.vals <- unique(ind) c <- nrow(z.unique) if(c == n) stop("design is not sparse (no repeated rows)") ## Kernel function (li-racine unordered kernel) kernel <- function(X,x,lambda) ifelse(X==x,1,lambda) ## Least-squares cross-validation objective function cv.ls.uniquecombs <- function(lambda,y,z,z.unique,ind,ind.vals,num.z,c) { g <- numeric() for(i in 1:c) { zz <- ind == ind.vals[i] L <- apply(sapply(1:num.z, function(j){kernel(z[,j], z.unique[ind.vals[i],j], lambda[j])}),1,prod) g[zz] <- (sum(y*L)-(y*L)[zz])/np:::NZD(sum(L)-L[zz]) } return(mean((y-g)^2)) } ## Start timing for comparison with npreg() ptm <- proc.time() ## Numerical optimization via optim() - could be faster if we also ## used fn.gradient (not done here) optim.out <- optim(runif(num.z), fn=cv.ls.uniquecombs, method="L-BFGS-B", lower=rep(0,num.z), upper=rep(1,num.z), y=y, z=z, z.unique=z.unique, ind=ind, ind.vals=ind.vals, num.z=num.z, c=c) ## Extract least-squares cross-validation optimal bandwidths lambda.opt <- optim.out$par ## Compute the estimator using the optimal bandwidths g <- numeric() for(i in 1:c) { zz <- ind == ind.vals[i] L <- apply(sapply(1:num.z, function(j){kernel(z[,j], z.unique[ind.vals[i],j], lambda.opt[j])}),1,prod) g[zz] <- sum(y*L)/np:::NZD(sum(L)) } ## Save execution time for optimization and estimation t.uniquecombs <- proc.time() - ptm ## Compare timing and results with npreg() (comment out the following ## for general use) library(np) g.Formula <- "y ~" for (p in 1:k) { g.Formula <- paste(g.Formula," + factor(z[,",p,"])") } g.Formula <- formula(g.Formula) t.npreg <- system.time(g.npreg <- npreg(g.Formula, ukertype="liracine", nmulti=1)) ## Compare fitted values for first 10 observations cbind(g[1:10],fitted(g.npreg)[1:10]) ## Compare bandwidths cbind(lambda.opt,g.npreg$bws$bw) ## Compare objective function values cbind(optim.out$value,g.npreg$bws$fval) ## Compare computation time in seconds cbind(t.uniquecombs[1],t.npreg[1]) \end{verbatim} \subsection{I want to bootstrap the standard errors for the coefficients in a varying coefficient specification. Is there a simple way to accomplish this?} The code below presents one way to accomplish this. For this example the model is \begin{equation*} Y_i=\alpha(Z_i)+X_i\beta(Z_i)+\epsilon_i, \end{equation*} where $X_i$ is experience and $Z_i$ is tenure. The estimated slope vector $\beta(Z_i)$ is therefore a function of tenure. We bootstrap its standard error and then plot the resulting estimate and 95\% asymptotic confidence bounds. \begin{verbatim} library(np) data(wage1) ## Fit the model model <- with(wage1,npscoef(tydat=lwage, txdat=exper, tzdat=tenure, betas=TRUE)) beta2 <- coef(model)[,2] ## Bootstrap the standard error for the slope coefficient (there ## will be two coefficient vectors, the intercept and slope ## for exper, which vary with tenure) B <- 100 coef.mat <- matrix(NA,nrow(wage1),B) for(b in 1:B) { ## Bootstrap the training data, evaluate on the original ## sample realizations, hold the bandwidth constant at ## that for the original model/training data wage1.boot <- wage1[sample(1:nrow(wage1),replace=TRUE),] model.boot <- with(wage1.boot,npscoef(tydat=lwage, txdat=exper, tzdat=tenure, exdat=wage1$exper, ezdat=wage1$tenure, betas=TRUE, bws=model$bw)) coef.mat[,b] <- coef(model.boot)[,2] } se.beta2 <- apply(coef.mat,1,sd) ci.l <- beta2-1.96*se.beta2 ci.u <- beta2+1.96*se.beta2 ylim <- range(c(beta2,ci.l,ci.u)) with(wage1,plot(tenure,beta2,ylim=ylim)) with(wage1,points(tenure,ci.l,col=2)) with(wage1,points(tenure,ci.u,col=2)) legend("topleft",c("beta2","ci.l","ci.u"),col=c(1,2,2),pch=c(1,1,1),bty="n") \end{verbatim} \bibliographystyle{plain} \bibliography{np_faq} \clearpage \appendix \section*{Changes from Version 0.60-17 to 0.60-18 [09-12-2024]} \begin{itemize} \item Patched issue with \texttt{as.double(data)} when data is a NULL object reported by Brian Ripley \item Merged rng pull request \texttt{https://github.com/JeffreyRacine/R-Package-np/pull/48} \item Incorporated Kurt Hornik's kind suggestion ``If indeed intended as a package overview help file, please add \texttt{{\textbackslash}alias\{np-package\}} to file \texttt{'np/man/np-package.Rd'}'' \end{itemize} \section*{Changes from Version 0.60-16 to 0.60-17 [12-Mar-2023]} \begin{itemize} \item Fixed warning \begin{verbatim} Check: S3 generic/method consistency Result: WARN compute.bootstrap.errors: function(..., bws) compute.bootstrap.errors.*} triggered potentially by lack of \texttt{...,bw} in functions called \end{verbatim} (thanks to Kurt Hornik for assistance!) \item Updated \texttt{inst/CITATIONS} to get rid of old-style cit and bib entries \end{itemize} \section*{Changes from Version 0.60-15 to 0.60-16 [18-Oct-2022]} \begin{itemize} \item Fixed \texttt{warning: a function definition without a prototype is deprecated in all versions of C and is not supported in C2x [-Wdeprecated-non-prototype]} (thanks Zhenghua!) \end{itemize} \section*{Changes from Version 0.60-14 to 0.60-15 [09-Oct-2022]} \begin{itemize} \item Explicit acknowledgment in \texttt{DESCRIPTION} of code \texttt{gsl\_bspline.c/.h} adapted from the GNU GSL library with further details appended to \texttt{gsl\_bspline.c/.h} (thanks to Professor Brian Ripley for noting this oversight). \end{itemize} \section*{Changes from Version 0.60-13 to 0.60-14 [22-Aug-2022]} \begin{itemize} \item Added logspline and ks to DESCRIPTION to address NOTE on some system builds stating \texttt{Undeclared packages logspline, ks in Rd xrefs} \item Zhenghua Nie kindly addressed the clang-UBSAN issue in jksum.c causing \texttt{runtime error: applying non-zero offset 8 to null pointer} \end{itemize} \section*{Changes from Version 0.60-12 to 0.60-13 [15-Aug-2022]} \begin{itemize} \item Minor changes to np.Rnw vignette viz Sweave options and class (redundancies commented out) \item Fixed issues relating to new NOTE ``Found if() conditions comparing class() to string [..] Use inherits() (or maybe is()) instead.'' Addressed by using \texttt{isa()} \end{itemize} \section*{Changes from Version 0.60-11 to 0.60-12 [12-Aug-2022]} \begin{itemize} \item JSS issues corrected in vignette/*.bib and inst/CITATION (use of DOI, etc., thanks to Achim Zeileis) \item Issues with utf8x in entropy\_np.Rnw and np.Rnw corrected (utf8x changed to utf8, etc.) \end{itemize} \section*{Changes from Version 0.60-10 to 0.60-11 [04-Jun-2021]} \begin{itemize} \item \texttt{npuniden.boundary()} uses empirical support for \texttt{a} and \texttt{b} (previously was \texttt{[0,1]}, now \texttt{[min(X),max(X)]}) \item Corrected potential issue with \texttt{malloc} failing on null vector in C function \texttt{sort\_unique()} detected by development version of \texttt{gcc11} (thanks to Professor Brian Ripley for pointing this out and for his assistance) \item Corrected outdated URLs \end{itemize} \section*{Changes from Version 0.60-9 to 0.60-10 [5-Feb-2020]} \begin{itemize} \item Corrected issue \texttt{Documented arguments not in usage in documentation object se: \dots} \item Removal of directed quotes in some Rd documents, replacement with \verb+\sQuote{}+ \item Fixed issue causing rebuilding of vignettes to crash (\texttt{uocquantile()} was the culprit) \end{itemize} \section*{Changes from Version 0.60-8 to 0.60-9 [24-Oct-2018]} \begin{itemize} \item Added support for character expansion to \texttt{npplot()} (which is invoked by calling \texttt{plot()} on an np object) - will take defaults from the environment \texttt{par()} or can be set directly, e.g., \texttt{plot(foo,cex.main=0.8)} \end{itemize} \section*{Changes from Version 0.60-7 to 0.60-8 [04-Jun-2018]} \begin{itemize} \item Used patch suggested by Scott Thompson for issue https://github.com/JeffreyRacine/R-Package-np/issues/8 regarding \texttt{npcdens()/npcdist()} and formulas \end{itemize} \section*{Changes from Version 0.60-6 to 0.60-7 [01-May-2018]} \begin{itemize} \item Fix for \texttt{switch()} and zero length character strings that affected summary output for certain tests \item Added function for shape constrained bounded univariate PDF and CDF estimation (\texttt{npuniden.sc()}) \item Added option for renormalizing improper bounded density estimates than can arise from the use of asymmetric negative edge kernel functions (\texttt{proper=TRUE}, \texttt{npuniden.boundary()}) \item Fixed issues with CDF estimation in the presence of large bandwidths for bounded density estimates (\texttt{npuniden.boundary()}) \end{itemize} \section*{Changes from Version 0.60-5 to 0.60-6 [12-Jan-2018]} \begin{itemize} \item Added more edge kernel functions for \texttt{npuniden.boundary()}, some glitches addressed \item Added asymptotic standard errors and integrated density (CDF) to \texttt{npuniden.boundary()} and \texttt{npuniden.reflect()} plus additional examples \item Added least squares cross-validation to \texttt{npuniden.boundary()} (default for boundary kernel functions) \end{itemize} \section*{Changes from Version 0.60-4 to 0.60-5 [04-Jan-2018]} \begin{itemize} \item Added two basic functions for univariate bounded kernel density estimation, \texttt{npuniden.boundary} and \texttt{npuniden.reflect} \item More plotting routines accept more passable parameters \end{itemize} \section*{Changes from Version 0.60-3 to 0.60-4 [03-Dec-2017]} \begin{itemize} \item Default \texttt{plot.errors.style} changed to \texttt{"band"} \item Using \texttt{wrapLines(strwrap(...))} for summary output for model specification test to automatically wrap long formula lines \item \texttt{plot()} calls \texttt{npplot()} and \texttt{npplot()} behaves better when most passable parameters are fed (\texttt{col =}, \texttt{lty =}, \texttt{xlab =}, etc.) but there may still be room for improvement \item Added option \texttt{plot.par.mfrow=TRUE/FALSE} to \texttt{npplot()} so that the automated setting of \texttt{par(mfrow=c(,))} in plot is disabled and the user can do this manually if desired \item Option \texttt{plot.par.mfrow=TRUE/FALSE} can be overridden by setting \texttt{options(plot.par.mfrow=TRUE/FALSE)} \end{itemize} \section*{Changes from Version 0.60-2 to 0.60-3 [29-Apr-2017]} \begin{itemize} \item Fixed the 'cannot find function \texttt{is()}' glitch that just surfaced with R 3.4.0 \item Fixed glitch in \texttt{npconmode()} when randomizing if all probabilities are equal \item Migrated some code from crs (which I wrote) to avoid \texttt{crs:::} \item Fixed outstanding glitch in \texttt{npscoef()} with \texttt{tydat} \item Option \texttt{leave.one.out} not properly implemented in \texttt{npscoef()}, corrected \item Using pmin/pmax in NZD (parallel min/max more efficient with vectors) \item Improved screen i/o when \texttt{smooth.residuals=TRUE/FALSE} \item Corrected glitch in \texttt{npregiv()} where multivariate z would halt with error \item \texttt{npregiv()} with \texttt{method="Landweber-Fridman"} no longer passes back lists \item Fixed glitch in \texttt{Kmat.lp} with p=0 and derivatives \item Tikhonov regularization in \texttt{npregiv()} now supports evaluation data \item Updated \texttt{npregiv()} (was sorely lacking, did not support eval etc.) \item Use of \texttt{.onLoad} suggested (by Bernd Bischl) \item Added \texttt{vignette("entropy\_np",package="np")} to startup message \item In \texttt{np.distribution.bw.R} default for \texttt{do.full.integral} changed to FALSE \item Fixed the CDF objective function when averaging over the training data \item Fixed crash error with \texttt{npcdistbw()} and adaptive bandwidths \end{itemize} \section*{Changes from Version 0.60-1 to 0.60-2 [27-Jun-2014]} \begin{itemize} \item added timeseries support for all relevant objects (i.e.\ for a \texttt{ts()} vector data object \texttt{x}, \verb^npreg(x~lag(x,-1)+lag(x,-2))^ is now supported) \item added total time to summary for bandwidth objects \item \texttt{npqreg()} no longer accepts \texttt{gradients=TRUE} when gradients are in fact not supported \item \texttt{npqreg()} fails with an informative message when passed a conditional density bandwidth object \end{itemize} \section*{Changes from Version 0.60-0 to 0.60-1 [6-Jun-2014]} \begin{itemize} \item Fixed glitch in adaptive\_nn/generalized\_nn bandwidths that affected all routines that rely on non-fixed bandwidths \item Tweaks to search for adaptive\_nn/generalized\_nn initial search values \item Fixed glitch in local linear estimation with adaptive\_nn bandwidths \end{itemize} \section*{Changes from Version 0.50-1 to 0.60-0 [1-Jun-2014]} \begin{itemize} \item Ordered kernel types now default to liracine/liracine (normalized) for conditional/unconditional objects, respectively (the previous default, i.e.\ the Wang van Ryzin kernel, is poorly behaved when smoothing out of ordered predictors is appropriate) \item Added analytical ordered CDF kernels, resulting in significant speedups for cross validation with ordered variables \item Added analytical ordered convolution kernels, resulting in significant speedups for least-squares cross validation with ordered variables \item The entire C backend has been rewritten and improved in almost every regard \item The Rmpi backend has been updated to Rmpi version 0.6-5 \item Glitch in adaptive convolution kernels corrected \item Added truncated gaussian kernel (see \texttt{ntpgauss()} for modifying the truncation radius) \item Support for trees complete (use \texttt{options(np.tree=TRUE)}) which when used in conjunction with bounded kernels (i.e.\ \texttt{"epanechnikov"}/\texttt{"truncated gaussian"}) can reduce the computational burden for certain problems \item Optimizers that make use of Powell's direction set method now accept additional arguments that can be used to potentially improve default settings \item Default search settings for optimizers that make use of Powell's direction set method should better scale to the range of variables \item Added mean absolute deviation/1.4826 to mix of robust scale elements \item Corrected error in order of conditional density/distribution manual bandwidths pointed out by Decet Romain \item Figure in vignette not displaying properly, needed \texttt{png=TRUE} reported by Christophe Bontemps \item Using \texttt{chol2inv}/\texttt{chol} rather than solve throughout R routines that rely on inversion \item Fixed glitch in \texttt{npindexbw()} to stop maxit from blowing up every time convergence fails \item Fixed issue with summary reporting incorrect value of objective function in certain bandwidth objects \item When \texttt{nmulti} $>$ 1, the full multi-starting search history is now returned in a vector named fval.history \item Added \texttt{na.action} for consistency with other R functions such as \texttt{lm()} \item New function \texttt{npquantile()} that returns smooth univariate quantiles \item \texttt{npksum()} explicitly only uses raw bandwidths now (and will emit an error if passed numeric scale factors, bandwidth objects are still OK) \item Fixed regression in \texttt{npindex()} with bootstrapped standard errors \item Code makes use of one call to \texttt{npksum()} in \texttt{npindex()} and \texttt{npscoef()} where possible rather than two separate calls \item Updated \texttt{npsigtest()} for addition of power and added joint test to the mix \item Changed \texttt{ceiling()} to \texttt{max(1,round())} in \texttt{b.star()} per Dimitris Politis's suggestion \item Reworked the interface for \texttt{npcopula()} to avoid two bandwidths and \texttt{density=TRUE} but exploiting passing of either \texttt{npudistbw()} (copula) or \texttt{npudensbw} (copula density) \end{itemize} \section*{Changes from Version 0.40-13 to 0.50-1 [13-Mar-2013]} \begin{itemize} \item The functions \texttt{npudist()} and \texttt{npudistbw()} are now uncoupled from \texttt{npudens()} and \texttt{npudensbw()} (previously they relied on unconditional PDF bandwidths due to the lack of a multivariate mixed-data method for selecting bandwidths for CDFs - now with Li \& Racine (2013) we have a robust method hence the split) \item The use of \texttt{cdf=TRUE} is deprecated for \texttt{npudens()} and \texttt{npudensbw()} objects due to the uncoupling described above \item Fixed glitch in \texttt{gradient} standard errors in \texttt{npindex()} where identical standard errors were output in \texttt{model\$gerr} \item Fixed glitch in covariance matrix in \texttt{npindex()} (\texttt{"ichimura"}) where covariance matrix was not symmetric \item Fixed glitch in npksum where use of \texttt{bwtype="adaptive\_nn"} and \texttt{operator="integral"} produced the survivor function rather than the cumulative distribution function \item Cleaned up internals for \texttt{npcmstest()} \item Using .onUnload rather than .Last.lib in zzz.R \item Fixed glitch in \texttt{npreg()} summary where 'Residual standard error' was reporting residual variance \item \texttt{npksum()} functionality extended \begin{itemize} \item \texttt{npksum()} can now return the matrix of kernel weights (which can be useful for constrained estimation, by way of illustration, or constructing novel kernel-based procedures without the need to write extensive code) \item \texttt{npksum()} can now accept different operators on a product kernel, for example, \begin{verbatim} npksum(txdat=data.frame(x1,x2),operator=c("derivative","normal"),bws=c(1,1)) \end{verbatim} will use the derivative kernel for x1 (i.e. the derivative of the gaussian kernel) and the default kernel for x2 (i.e. a standard kernel such as the gaussian) thereby allowing the user to program a number of estimators such as conditional CDFs etc. that were previously not available via \texttt{npksum()} \end{itemize} \item Fixed glitch with variable scope where certain objects could not be found in the environment \item Added function \texttt{npcopula()} for d-dimensional copula estimation via inversion \item Modified stopping rules in \texttt{npregiv()} and \texttt{npregivderiv()} \item Added reference to $R^2$ measure (Doksum and Samarov (1995)) \item Startup message points to the faq, faq is now a vignette \end{itemize} \section*{Changes from Version 0.40-12 to 0.40-13 [05-Mar-2012]} \begin{itemize} \item Added new function \texttt{npregivderiv()} that implements the IV derivative method of Florens and Racine (2012) \item Added more passable parameters to \texttt{npregiv()} (multistarting, parameters passed to \texttt{optim()} for cross-validation) \item Changes to code to improve compliance with R 'Writing portable packages' guidelines and correct partial argument matches \end{itemize} \section*{Changes from Version 0.40-11 to 0.40-12 [24-Nov-2011]} \begin{itemize} \item Added option (user request) to hold the bandwidth fixed but optimize the parameters in the single index model \end{itemize} \section*{Changes from Version 0.40-10 to 0.40-11 [24-Oct-2011]} \begin{itemize} \item Corrected code regression in single index errors introduced inadvertently in 0.40-10 \end{itemize} \section*{Changes from Version 0.40-9 to 0.40-10 [24-Oct-2011]} \begin{itemize} \item Modified Silverman's adaptive measure of spread to reflect changes in \texttt{sd()} (sd on matrix deprecated) \end{itemize} \section*{Changes from Version 0.40-8 to 0.40-9 [30-July-2011]} \begin{itemize} \item Renamed COPYING file to COPYRIGHTS \end{itemize} \section*{Changes from Version 0.40-7 to 0.40-8 [29-July-2011]} \begin{itemize} \item Fixed issue where calling \texttt{npplot} resets system seed \item Updated examples in docs so that plot is recommended throughout (and not \texttt{npplot} that is invoked by \texttt{plot}) \item Fixed regression in \texttt{npindex} when \texttt{gradients=TRUE} and \texttt{errors=TRUE} \item Function \texttt{npindex/npindexbw} now accepts additional arguments and implements this properly (i.e. proper implementation by Tristen of Version 0.30-8/0.30-9 change for \texttt{npindex}) \item Function \texttt{npplreg} now supports factors in the parametric part just like \texttt{lm()} does \end{itemize} \section*{Changes from Version 0.40-6 to 0.40-7 [8-Jun-2011]} \begin{itemize} \item Function \texttt{npregiv} now supports exogenous $X$ and multivariate $Z$ and $W$. \item \texttt{demo(npregiv)} provides a useful illustration. \end{itemize} \section*{Changes from Version 0.40-5 to 0.40-6 [1-Jun-2011]} \begin{itemize} \item Added a new function \texttt{npregiv} that conducts nonparametric instrumental regression a la Darolles, Fan, Florens and Renault (2011, Econometrica) and Horowitz (2011, Econometrica). Note that this function currently returns the fitted $\varphi(z)$ (i.e.\ lacks much functionality relative to other \texttt{np} functions) and is in 'beta status' until further notice. \item Added a new dataset \texttt{Engel95} that allows one to estimate Engel curves using the new nonparametric instrumental regression function \texttt{npregiv}. \end{itemize} \section*{Changes from Version 0.40-4 to 0.40-5 [26-Apr-2011]} \begin{itemize} \item Fixed issue with \texttt{npindexbw} where, for certain problems, starting values needed refinement otherwise convergence would fail (we now use an improved normalization for the starting values) \end{itemize} \section*{Changes from Version 0.40-3 to 0.40-4 [21-Jan-2011]} \begin{itemize} \item Fixed issue with \texttt{ckertype} and \texttt{ckerorder} not being propagated in \texttt{np.singleindex.bw.R} \item Fixed issue with negative penalties being returned by \texttt{bwmethod="cv.aic"} in \texttt{npregbw} (ought to have been +infinity) \item Error being thrown by \texttt{system(..., intern=TRUE)} when \texttt{mpi.quit()} is called, changed to FALSE (change to \texttt{system()} behaviour detailed in R CHANGELOG 2.12.0) \end{itemize} \section*{Changes from Version 0.40-1 to 0.40-3 [23-Jul-2010]} \begin{itemize} \item Added random seed (defaults to 42) to \texttt{npscoefbw} to ensure consistent values for optimization for successive invocations on the same data \item Fixed glitch in multistarting in \texttt{npscoefbw} whereby multistarting was not working (always returned last computed function value and not the minimum) \item Fixed issue for \texttt{npRmpi} where the \texttt{C} code underlying regression cross-validation (code in \texttt{jksum.c}) differs between np and npRmpi (both are correct with the latter being a tad slower) \item Fixed a scope issue whereby a user would write a function that calls an \texttt{np/npRmpi} command, however, objects passed to the user's function and called by the \texttt{np/npRmpi} command (i.e. such as \texttt{newdata}) cannot be found in the environment yet they exist outside of the function \item Fixed issue with \texttt{bwscaling=TRUE} and \texttt{bwmethod="cv.aic"} in npreg \end{itemize} \section*{Changes from Version 0.40-0 to 0.40-1 [4-Jun-2010]} \begin{itemize} \item Added asymptotic standard errors to \texttt{npindex} for the Klein and Spady and Ichimura parameter estimates which, when \texttt{gradients=TRUE}, can be extracted via \texttt{vcov(foo)} where \texttt{foo} is a npsingleindex object (the Z-scores can be obtained via \texttt{Z <- coef(foo)[-1]/sqrt(diag(vcov(foo)))[-1]}) \end{itemize} \section*{Changes from Version 0.30-9 to 0.40-0 [25-May-2010]} \begin{itemize} \item Modified codebase to enable dynamic spawning for interactive sessions in \texttt{npRmpi} \item Interactive examples supported in \texttt{npRmpi} \end{itemize} \section*{Changes from Version 0.30-8 to 0.30-9 [17-May-2010]} \begin{itemize} \item Fixed issue where \texttt{ukertype} and \texttt{okertype} were being ignored by \texttt{npscoef} \item Fixed code regression (dating to version 0.30-4) where \texttt{random.seed=42} was not initialized in functions \texttt{npcmstest}, \texttt{npdeneqtest}, \texttt{npindexbw}, \texttt{npsdeptest}, \texttt{npqcmstest}, \texttt{npsigtest}, \texttt{npsymtest}, \texttt{npunitest}, and \texttt{npplot} \item Fixed issue with saving and restoring random seed in \texttt{npdeptest} \item Changes to codebase to modify method used to prevent division by zero \item New vignette for the \texttt{npRmpi} package \texttt{(vignette("npRmpi",package="npRmpi"))} \end{itemize} \section*{Changes from Version 0.30-7 to 0.30-8 [20-Apr-2010]} \begin{itemize} \item Implemented moment version of metric entropy in \texttt{npsymtest} and \texttt{npunitest} with warnings about their use documented carefully and exceptions trapped and warnings issuing when detected \item Cleaned up print/summary output formatting of some functions \end{itemize} \section*{Changes from Version 0.30-6 to 0.30-7 [15-Feb-2010]} \begin{itemize} \item Added function \texttt{npunitest} for entropy-based testing of equality of univariate densities as described in Maasoumi and Racine (2002) \item Updated vignette to reflect new functions from 0.30-4 upwards (Table 1: np functions) \end{itemize} \section*{Changes from Version 0.30-5 to 0.30-6 [3-Feb-2010]} \begin{itemize} \item Added function \texttt{npsdeptest} for entropy-based testing of nonlinear serial dependence described in Granger, Maasoumi and Racine (2004) \item Added function \texttt{npdeptest} for entropy-based testing of nonlinear pairwise dependence described in Maasoumi and Racine (2002) \item Added more bootstrap options to \texttt{npsymtest} (now both iid and time-series bootstrapping are supported) \item Cleaned up summary formatting in the vignette by adding \verb+\usepackage[utf8x]{inputenc}+ to the Sweave file np.Rnw \item Fixed issue with saving and restoring random seed when there was none in the environment \end{itemize} \section*{Changes from Version 0.30-4 to 0.30-5 [29-Jan-2010]} \begin{itemize} \item Added function \texttt{npdeneqtest} for integrated squared difference testing of equality of densities as described in Maasoumi, Li, and Racine (2009), Journal of Econometrics \item Save random seed prior to setting seed in certain functions, then restore seed after function completes \end{itemize} \section*{Changes from Version 0.30-3 to 0.30-4 [27-Jan-2010]} \begin{itemize} \item Added function \texttt{npsymtest} for entropy-based testing of symmetry described in Maasoumi and Racine (2009), Econometric Reviews \item Added function \texttt{b.star} that automates block length selection for the stationary and circular bootstrap \item Cleaned up docs \end{itemize} \section*{Changes from Version 0.30-2 to 0.30-3 [28-May-2009]} \begin{itemize} \item Corrected error in Epanechnikov convolution kernels for fixed and generalized bandwidth objects \item Changed default example in \texttt{npscoef} \end{itemize} \section*{Changes from Version 0.30-1 to 0.30-2 [19-Apr-2009]} \begin{itemize} \item \texttt{min(std,IQR/1.348)} is the adaptive measure of spread. We now test for the pathological case where \texttt{IQR=0} but \texttt{std>0} and return \texttt{std} in this instance \end{itemize} \section*{Changes from Version 0.30-0 to 0.30-1 [29-Jan-2009]} \begin{itemize} \item \texttt{predict()} now supports bandwidth, density, distribution, conbandwidth, condensity, and condistribution objects \item Consistently allow predictions for categorical values outside of support of training data Note that predictions based upon unconditional density objects defined over categorical variables that lie outside the support of the training data may no longer be true probabilities (i.e., as defined over the training data and the extended/augmented support -- their sum may exceed one) and may therefore require renormalization by the user \item Fixed a numerical issue which could hinder \texttt{npregbw()}'s cross validation with higher-order kernels \item Default \texttt{nmulti} in \texttt{npplregbw()} is now set correctly \item Fixed a bug with the ridging routine in \texttt{npscoefbw()}, added ridging to \texttt{npscoef()} \item Fixed minor i/o issue with \texttt{Multistart 1 of...} using \texttt{npscoefbw()} \end{itemize} \section*{Changes from Version 0.20-4 to 0.30-0 [15-Jan-2009]} \begin{itemize} \item Added basic user-interrupt checking for all underlying C code so that either \verb++ (Rterm) or the 'STOP' icon (Rgui) will interrupt all running processes. This has a number of desirable side effects in addition to being able to interrupt C-based processes including i) R no longer showing up as 'not responding' under the task manager (Windows) or the activity monitor (Mac OS X) and ii) buffered output now being correctly displayed when using Rgui under Windows and Mac OS X Note that repeated interruption of large jobs can reduce available memory under R - if this becomes an issue (i.e., you get a 'cannot allocate...' error under R) simply restart R (i.e., exit then run a fresh R session) \item Added a function \texttt{npseed()} that allows the user to set/reset the random seed for all underlying C routines \item Fixed a bug that caused \texttt{npplregbw()} to ignore any kernel options for the regression of $y$ on $z$ \item Refined certain constants used in the normal-reference density bandwidth rule for increased accuracy \item Moved from using the maximum likelihood estimate of variance throughout to the degrees of freedom corrected estimate (all variance estimates now change by the factor (n-1)/n) \end{itemize} \section*{Changes from Version 0.20-3 to 0.20-4 [19-Nov-2008]} \begin{itemize} \item Using an adaptive measure of spread throughout. The scale factor reported for a bandwidth can appear to be small when the sample standard deviation of the associated variable is inflated due to the presence of outliers. Furthermore, supplying a scale factor of, say, 1.06 for density estimation when there are outliers that inflate the standard deviation may oversmooth rather dramatically in the presence of outliers. We now use the measure found in Silverman (1986, equation (3.30)) which is min(standard deviation, interquartile range/1.349). This robust choice produces expected results for scale factors in the presence of outliers \end{itemize} \section*{Changes from Version 0.20-2 to 0.20-3[14-Nov-2008] } \begin{itemize} \item Fixed a typo which caused \texttt{predict()} and \texttt{plot()} to abort when called on \texttt{plregression} objects, and which also prevented \texttt{print()} and \texttt{summary()} from printing information about the kernels used when called on \texttt{plregression} objects \item Fixed a typo which caused partially linear regressions to crash when out-of-sample responses were provided with evaluation data \end{itemize} \section*{Changes from Version 0.20-1 to 0.20-2 [02-Nov-2008]} \begin{itemize} \item Allow for evaluation outside of discrete support of factors in \texttt{npksum()} and fixed a warning in jksum \item Fixed a bug which lead to unpredictable behavior when there were more categorical values for the training data than realisations \end{itemize} \section*{Changes from Version 0.20-0 to 0.20-1 [13-Aug-2008]} \begin{itemize} \item Work-around for scale-factor issues during \texttt{npregbw()} cv when changing the training data \end{itemize} \section*{Changes from Version 0.14-3 to 0.20-0 [28-Jul-2008]} \begin{itemize} \item \texttt{npksum()} now supports an expanded set of kernels (including convolution, derivative and integral), which can be selected via the \texttt{operator =} argument \item Automatic bandwidth searches are now performed when attempting to evaluate on data without bandwidths. This allows users to combine bandwidth selection and estimation in one step \item The \texttt{npsigtest()} interface is brought in line with other functions (S3) \item Significance tests can now be performed on \texttt{npreg()} outputs, so \texttt{npsigtest(modelname)} is now supported \item Added a vignette and faq. To see the vignette try \texttt{vignette("np",package="np")} \item \texttt{summary()} on \texttt{npconmode()} now properly retrieves names from bandwidth objects \item Fixed the 6th and 8th order epanechnikov kernels \item Fixed some quietness issues \item \texttt{npplot()} now returns data upon request for conditional densities \item \texttt{npreg()} and \texttt{npcdens()} now take the appropriate limits in some pathological cases \item User supplied bandwidths now operate seamlessly with the formula interface \end{itemize} \section*{Changes from Version 0.14-2 to 0.14-3 [02-May-2008]} \begin{itemize} \item Fixed a glitch that only arose when using the \texttt{liracine} unordered kernel in the presence of irrelevant variables. The upper bound for numerical search was constrained to be (c-1)/c [that for the \texttt{aitchisonaitken} unordered kernel] but ought to have been 1. The summary output would therefore show a value of lambda hitting the (smaller) upper bound (c-1)/1 when it may have hit the (larger) upper bound 1 \end{itemize} \section*{Changes from Version 0.14-1 to 0.14-2 [11-Jan-2008]} \begin{itemize} \item Relaxed checking tolerances slightly to prevent spurious 'invalid bandwidth' errors \item Empty sections were removed from help files \item \texttt{example(foobar)} now works again. This was disabled in 0.14-1 at the request of the \texttt{R} maintainers in order to shorten the duration of \texttt{R CMD check}. All examples remained in the help files but due to the presence of 'dontrun' they were not run when \texttt{example(foobar)} is requested. Now a limited subset is run while the full set of examples remain in the documents \end{itemize} \section*{Changes from Version 0.13-1 to 0.14-1 [18-Dec-2007] } \begin{itemize} \item Now use \texttt{optim()} for minimisation in single index and smooth coefficient models \item Fixed bug in klein-spady objective function \item Standard errors are now available in the case of no continuous variables \item Summary should look prettier, print additional information \item Tidied up lingering issues with out-of-sample data and conditional modes \item Fixed error when plotting asymptotic errors with conditional densities \item Fixed a bug in \texttt{npplot()} with partially linear regressions and \texttt{plot.behavior="data"} or \texttt{"plot-data"} \item Maximum default number of multistarts is 5 \item Least-squares cross-validation of conditional densities uses a new, much faster algorithm \item New, faster algorithm for least-squares cross-validation for both local-constant and local linear regressions Note that the estimator has changed somewhat: both cross-validation and the estimator itself use a method of shrinking towards the local constant estimator when singularity would otherwise lead to the breakdown of the estimator. This arises in sparse data settings in conjunction with small bandwidths for one or more regressor \item Optimised smooth coefficient code, added ridging \item Fixed bug in uniform CDF kernel \item Fixed bug where \texttt{npindexbw()} would ignore \texttt{bandwidth.compute = FALSE} and compute bandwidths when supplied with a preexisting bw object \item Now can handle estimation out of discrete support \item Summary would misreport the values of discrete scale factors which were computed with \texttt{bwscaling = TRUE} \end{itemize} \section*{Changes from Version 0.12-1 to 0.13-1 [03-May-2007]} \begin{itemize} \item Bandwidths are now checked for validity based on their variable and kernel types \item np now does a better job of preserving names of some 'y' data \item Names of coefficients returned from \texttt{coef()} now match variable names \item Fixed some corner cases in \texttt{npksum()} involving the dimensionality of outputs \item Fixed deprecation warnings in \texttt{R} 2.5.0 caused by use of \$ on atomic objects \item Various and sundry bug fixes in \texttt{npscoef()} \item \texttt{npscoef()} now handles discrete 'z' data \item Predict now accepts the argument 'se.fit', like predict.lm \item Fixed bug where incorrect asymptotic standard errors of gradients for regression objects were being displayed in \texttt{npplot()} \item Fixed bug where errors of gradients of regression objects were not being returned in matrix form \item \texttt{vcov()} now works with partially linear regression objects \item Fixed detection of evaluation responses when using the formula interface \item Pre-computed bandwidth objects are now provided for some of the more computationally burdensome examples \item Added Jeffrey Wooldridge's WAGE1 dataset with qualitative variables (married, female, nonwhite) \item Predictions outside of discrete support for regressions and conditional densities are now allowed \item Fixed sign issue with scaling of standard errors in the single index model \item Fixed error when calculating some bandwidths/scale factors for display purposes \item Bug in passing certain arguments to \texttt{npcdensbw()} fixed \item Added predict method for qregression objects \item Proper normalisation for liracine kernel shown in summary \item Fixed output bug ($\hat ~$H) in summary method for sigtest objects \item Fixed regression with plotting of bootstrapped errors in perspective plots \item \texttt{npcdist()} no longer incorrectly calls \texttt{npcdens()} \item Fixed spacing between var name and p-value in significance test summaries \end{itemize} \section*{Version 0.12-1 [19-Nov-2006]} \begin{itemize} \item Initial release of the np package on CRAN \end{itemize} \end{document}