[R] quantreg speed

Sun Nov 16 02:40:48 CET 2014

Hi William,

Thank you very much for your reply.

I did a subsampling to reduce the number of samples to ~1.8 million. It
seems to work fine except for 99th percentile (p-values for all the
features are 1.0). Does this mean I’m subsampling too much? How should I
interpret the result?

tau: [1] 0.25

Coefficients:

               Value      Std. Error t value    Pr(>|t|)

(Intercept)      72.15700    0.03651 1976.10513    0.00000

f1            -0.51000    0.04906  -10.39508    0.00000

f2            -20.44200    0.03933 -519.78766    0.00000

f3              -2.37000    0.04871  -48.65117    0.00000

f1:f2       -2.52500    0.05315  -47.50361    0.00000

f1:f3         1.03600    0.06573   15.76193    0.00000

f2:f3          3.41300    0.05247   65.05075    0.00000

f1:f2:f3   -0.83800    0.07120  -11.77002    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

    0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.5

Coefficients:

               Value      Std. Error t value    Pr(>|t|)

(Intercept)      83.80900    0.05626 1489.61222    0.00000

f1            -0.92200    0.07528  -12.24692    0.00000

f2            -27.90700    0.05937 -470.07189    0.00000

f3              -6.45000    0.07204  -89.53909    0.00000

f1:f2       -2.66500    0.07933  -33.59275    0.00000

f1:f3         1.99000    0.09869   20.16440    0.00000

f2:f3          7.09600    0.07611   93.23813    0.00000

f1:f2:f3   -1.71200    0.10390  -16.47660    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

    0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.75

Coefficients:

               Value      Std. Error t value    Pr(>|t|)

(Intercept)     102.71700    0.10175 1009.45946    0.00000

f1            -1.59300    0.13241  -12.03125    0.00000

f2            -40.64200    0.10623 -382.58456    0.00000

f3             -14.40900    0.12096 -119.11988    0.00000

f1:f2       -2.97600    0.13867  -21.46071    0.00000

f1:f3         3.74600    0.16335   22.93165    0.00000

f2:f3         14.14800    0.12692  111.47217    0.00000

f1:f2:f3   -3.16400    0.17159  -18.43899    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

    0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.9

Coefficients:

               Value      Std. Error t value    Pr(>|t|)

(Intercept)     130.89400    0.20609  635.12464    0.00000

f1            -2.55500    0.28139   -9.07995    0.00000

f2            -60.90500    0.21322 -285.64558    0.00000

f3             -29.42300    0.23409 -125.69092    0.00000

f1:f2       -2.77700    0.29052   -9.55870    0.00000

f1:f3         7.89700    0.33308   23.70870    0.00000

f2:f3         27.78100    0.24338  114.14722    0.00000

f1:f2:f3   -6.95800    0.34491  -20.17327    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

    0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.95

Coefficients:

               Value      Std. Error t value    Pr(>|t|)

(Intercept)     157.45900    0.42733  368.47413    0.00000

f1            -4.10200    0.55834   -7.34678    0.00000

f2            -81.24000    0.44012 -184.58697    0.00000

f3             -46.17500    0.46235  -99.87033    0.00000

f1:f2       -2.01700    0.57651   -3.49866    0.00047

f1:f3        15.67000    0.67409   23.24600    0.00000

f2:f3         43.00100    0.47973   89.63500    0.00000

f1:f2:f3  -14.05100    0.69737  -20.14843    0.00000

Call: rq(formula = output ~ f1 + f2 + f3 + f1 * f2 + f1 *

    f3 + f2 * f3 + f1 * f2 * f3, tau = c(0.25, 0.5,

    0.75, 0.9, 0.95, 0.99), data = data_stats)

tau: [1] 0.99

Coefficients:

               Value         Std. Error    t value       Pr(>|t|)

(Intercept)     2.544860e+02  3.878303e+07  1.000000e-05  9.999900e-01

f1          -1.420000e+01  5.917548e+11  0.000000e+00  1.000000e+00

f2           -1.582920e+02  3.450261e+07  0.000000e+00  1.000000e+00

f3            -1.139210e+02  4.763057e+07  0.000000e+00  1.000000e+00

f1:f2      5.725000e+00  1.324283e+12  0.000000e+00  1.000000e+00

f1:f3       6.811780e+02  1.153645e+13  0.000000e+00  1.000000e+00

f2:f3        1.042510e+02  2.299953e+24  0.000000e+00  1.000000e+00

f1:f2:f3 -6.763210e+02  2.299953e+24  0.000000e+00  1.000000e+00

Warning message:

In summary.rq(xi, ...) : 288000 non-positive fis

On Sat, Nov 15, 2014 at 8:19 PM, William Dunlap <wdunlap at tibco.com> wrote:

> You can time it yourself on increasingly large subsets of your data.  E.g.,
>
> > dat <- data.frame(x1=rnorm(1e6), x2=rnorm(1e6),
> x3=sample(c("A","B","C"),size=1e6,replace=TRUE))
> > dat$y <- with(dat, x1 + 2*(x3=="B")*x2 + rnorm(1e6))
> > t <- vapply(n<-4^(3:10),FUN=function(n){d<-dat[seq_len(n),];
> print(system.time(rq(data=d, y ~ x1 + x2*x3,
> tau=0.9)))},FUN.VALUE=numeric(5))
>    user  system elapsed
>       0       0       0
>    user  system elapsed
>       0       0       0
>    user  system elapsed
>    0.02    0.00    0.01
>    user  system elapsed
>    0.01    0.00    0.02
>    user  system elapsed
>    0.10    0.00    0.11
>    user  system elapsed
>    1.09    0.00    1.10
>    user  system elapsed
>   13.05    0.02   13.07
>    user  system elapsed
>  273.30    0.11  273.74
> > t
>            [,1] [,2] [,3] [,4] [,5] [,6]  [,7]   [,8]
> user.self     0    0 0.02 0.01 0.10 1.09 13.05 273.30
> sys.self      0    0 0.00 0.00 0.00 0.00  0.02   0.11
> elapsed       0    0 0.01 0.02 0.11 1.10 13.07 273.74
> user.child   NA   NA   NA   NA   NA   NA    NA     NA
> sys.child    NA   NA   NA   NA   NA   NA    NA     NA
>
> Do some regressions on t["elapsed",] as a function of n and predict up to
> n=10^7.  E.g.,
> > summary(lm(t["elapsed",] ~ poly(n,4)))
>
> Call:
> lm(formula = t["elapsed", ] ~ poly(n, 4))
>
> Residuals:
>          1          2          3          4          5          6
>  7          8
> -2.375e-03 -2.970e-03  4.484e-03  1.674e-03 -8.723e-04  6.096e-05
> -9.199e-07  2.715e-09
>
> Coefficients:
>              Estimate Std. Error  t value Pr(>|t|)
> (Intercept) 3.601e+01  1.261e-03 28564.33 9.46e-14 ***
> poly(n, 4)1 2.493e+02  3.565e-03 69917.04 6.45e-15 ***
> poly(n, 4)2 5.093e+01  3.565e-03 14284.61 7.57e-13 ***
> poly(n, 4)3 1.158e+00  3.565e-03   324.83 6.43e-08 ***
> poly(n, 4)4 4.392e-02  3.565e-03    12.32  0.00115 **
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 0.003565 on 3 degrees of freedom
> Multiple R-squared:      1,     Adjusted R-squared:      1
> F-statistic: 1.273e+09 on 4 and 3 DF,  p-value: 3.575e-14
>
>
> It does not look good for n=10^7.
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Sat, Nov 15, 2014 at 12:12 PM, Yunqi Zhang <yqzhang at ucsd.edu> wrote:
>
>> Hi all,
>>
>> I'm using quantreg rq() to perform quantile regression on a large data
>> set.
>> Each record has 4 fields and there are about 18 million records in total.
>> I
>> wonder if anyone has tried rq() on a large dataset and how long I should
>> expect it to finish. Or it is simply too large and I should subsample the
>> data. I would like to have an idea before I start to run and wait forever.
>>
>> In addition, I will appreciate if anyone could give me an idea how long it
>> takes for rq() to run approximately for certain dataset size.
>>
>> Yunqi
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

	[[alternative HTML version deleted]]